[2/2] kvm: powerpc: set cache coherency only for kernel managed pages

From: Andrea Arcangeli <aarcange@redhat.com>

Hi!

On Wed, Jul 24, 2013 at 01:30:12PM +0300, Gleb Natapov wrote:
> On Wed, Jul 24, 2013 at 12:25:18PM +0200, Alexander Graf wrote:
> > 
> > On 24.07.2013, at 12:19, Gleb Natapov wrote:
> > 
> > > On Wed, Jul 24, 2013 at 12:09:42PM +0200, Alexander Graf wrote:
> > >> 
> > >> On 24.07.2013, at 12:01, Gleb Natapov wrote:
> > >> 
> > >>> Copying Andrea for him to verify that I am not talking nonsense :)
> > >>> 
> > >>> On Wed, Jul 24, 2013 at 10:25:20AM +0200, Alexander Graf wrote:
> > >>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > >>>>> index 1580dd4..5e8635b 100644
> > >>>>> --- a/virt/kvm/kvm_main.c
> > >>>>> +++ b/virt/kvm/kvm_main.c
> > >>>>> @@ -102,6 +102,10 @@ static bool largepages_enabled = true;
> > >>>>> 
> > >>>>> bool kvm_is_mmio_pfn(pfn_t pfn)
> > >>>>> {
> > >>>>> +#ifdef CONFIG_MEMORY_HOTPLUG
> > >>>> 
> > >>>> I'd feel safer if we narrow this down to e500.
> > >>>> 
> > >>>>> +       /*
> > >>>>> +        * Currently only in memory hot remove case we may still need this.
> > >>>>> +        */
> > >>>>>      if (pfn_valid(pfn)) {
> > >>>> 
> > >>>> We still have to check for pfn_valid, no? So the #ifdef should be down here.
> > >>>> 
> > >>>>>              int reserved;
> > >>>>>              struct page *tail = pfn_to_page(pfn);
> > >>>>> @@ -124,6 +128,7 @@ bool kvm_is_mmio_pfn(pfn_t pfn)
> > >>>>>              }
> > >>>>>              return PageReserved(tail);
> > >>>>>      }
> > >>>>> +#endif
> > >>>>> 
> > >>>>>      return true;
> > >>>>> }
> > >>>>> 
> > >>>>> Before apply this change:
> > >>>>> 
> > >>>>> real    (1m19.954s + 1m20.918s + 1m22.740s + 1m21.146s + 1m22.120s)/5= 1m21.376s
> > >>>>> user    (0m23.181s + 0m23.550s + 0m23.506s + 0m23.410s + 0m23.520s)/5= 0m23.433s
> > >>>>> sys	(0m49.087s + 0m49.563s + 0m51.758s + 0m50.290s + 0m51.047s)/5= 0m50.349s
> > >>>>> 
> > >>>>> After apply this change:
> > >>>>> 
> > >>>>> real    (1m19.507s + 1m20.919s + 1m21.436s + 1m21.179s + 1m20.293s)/5= 1m20.667s
> > >>>>> user    (0m22.595s + 0m22.719s + 0m22.484s + 0m22.811s + 0m22.467s)/5= 0m22.615s
> > >>>>> sys	(0m48.841s + 0m49.929s + 0m50.310s + 0m49.813s + 0m48.587s)/5= 0m49.496s
> > >>>>> 
> > >>>>> So,
> > >>>>> 
> > >>>>> real    (1m20.667s - 1m21.376s)/1m21.376s x 100% = -0.6%
> > >>>>> user    (0m22.615s - 0m23.433s)/0m23.433s x 100% = -3.5%
> > >>>>> sys	(0m49.496s - 0m50.349s)/0m50.349s x 100% = -1.7%
> > >>>> 
> > >>>> Very nice, so there is a real world performance benefit to doing this. Then yes, I think it would make sense to change the global helper function to be fast on e500 and use that one from e500_shadow_mas2_attrib() instead.
> > >>>> 
> > >>>> Gleb, Paolo, any hard feelings?
> > >>>> 
> > >>> I do not see how can we break the function in such a way and get
> > >>> away with it. Not all valid pfns point to memory. Physical address can
> > >>> be sparse (due to PCI hole, framebuffer or just because).
> > >> 
> > >> But we don't check for sparseness today in here either. We merely check for incomplete huge pages.
> > >> 
> > > That's not how I read the code. The code checks for reserved flag set.
> > > It should be set on pfns that point to memory holes. As far as I
> > 
> > I couldn't find any traces of code that sets the reserved bits on e500 chips though. I've only seen it getting set for memory hotplug and memory incoherent DMA code which doesn't get used on e500.
> > 
> > But I'd be more than happy to get proven wrong :).
> > 
> Can you write a module that scans all page structures? AFAIK all pages
> are marked as reserved and then those that become regular memory are
> marked as unreserved. Hope Andrea will chime in here :)

So the situation with regard to non-RAM and PageReserved/pfn_valid is
quite simple.

"struct page" exists for non-RAM too as "struct page" must exist up to
at least 2^MAX_ORDER pfn alignment or things breaks, like the first
pfn must be 2^MXA_ORDER aligned or again things break in the buddy. We
don't make an effort to save a few "struct page" to keep it simpler.

But those non-RAM pages (or tiny non-RAM page holes if any) are marked
PageReserved.

If "struct page" doesn't exist pfn_valid returns false.

So you cannot get away skipping pfn_valid and at least one
PageReserved.

However it gets more complex than just ram vs non-RAM, because there
are pages that are real RAM (not left marked PageReserved at boot
after checking e820 or equivalent bios data for non-x86 archs) but
that are taken over by drivers, than then could use it as mmio regions
snooping the writes and mapping them in userland too as hugepages
maybe. That is the motivation for the THP related code in
kvm_is_mmio_pfn.

Those vmas have VM_PFNMAP set so vm_normal_page is zero and the
refcounting is skipped like if it's non-RAM and they're mapped with
remap_pfn_range (different mechanism for VM_MIXEDMAP that does the
refcounting and doesn't require in turn the driver to mark the page
PageReserved).

The above explains why KVM needs to skip the refcounting on
PageReserved == true && pfn_valid() == true, and it must skip the
refcounting for pfn_valid == false without trying to call pfn_to_page
(or it'll crash).

Now the code doing the THP check with smp_rmb is very safe, possibly
too safe. Looking at it now, it looks a minor overengineering
oversight.

The slight oversight is that split_huge_page cannot transfer the
PG_reserved bit from head to tail.

So there's no real risk that the driver allocates an hugepage, marks
the head reserved (the PG_ bits of a THP page are only relevant in the
head), maps the page with some new version of remap_pfn_range_huge
(not possible right now, PFNMAP|MIXEDMAP only can handle 4k mappings
right now) and then split_huge_page runs and we miss the reserved bit
on the tail page. Because the reserved bit wouldn't be transferred to
the tail page anyway by split_huge_page so we'd miss it anyway if
anything like that would happen.

Besides split_huge_page couldn't run on a device owned page as it's
not anonymous but device-owned and there's no way to map it with a
hugepmd too.

So in short, it's probably never going to help to have such a check
there. We can probably optimize away the THP code in there.

No matter how the driver maps this hypotetic new type of reserved
hugepage in userland, it should never allow split_huge_page to run on
it, and then it should take care of marking all subpages as reserved
too. And KVM won't need to worry about a driver setting reserved only
on a head page anymore.

Untested RFC patch follows.

==
From 76927680df7034a575bed5da754f7ebe94481fb3 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Thu, 25 Jul 2013 02:56:08 +0200
Subject: [PATCH] kvm: optimize away THP checks in kvm_is_mmio_pfn()

The checks on PG_reserved in the page structure on head and tail pages
aren't necessary because split_huge_page wouldn't transfer the
PG_reserved bit from head to tail anyway.

This was a forward-thinking check done in the case PageReserved was
set by a driver-owned page mapped in userland with something like
remap_pfn_range in a VM_PFNMAP region, but using hugepmds (not
possible right now). It was meant to be very safe, but it's overkill
as it's unlikely split_huge_page could ever run without the driver
noticing and tearing down the hugepage itself.

And if a driver in the future will really want to map a reserved
hugepage in userland using an huge pmd it should simply take care of
marking all subpages reserved too to keep KVM safe. This of course
would require such a hypothetical driver to tear down the huge pmd
itself and splitting the hugepage itself, instead of relaying on
split_huge_page, but that sounds very reasonable, especially
considering split_huge_page wouldn't currently transfer the reserved
bit anyway.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 virt/kvm/kvm_main.c | 24 ++----------------------
 1 file changed, 2 insertions(+), 22 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[2/2] kvm: powerpc: set cache coherency only for kernel managed pages

Commit Message

Patch