mbox series

[v2,0/2] mm: Memory offlining + page isolation cleanups

Message ID 20191021172353.3056-1-david@redhat.com (mailing list archive)
Headers show
Series mm: Memory offlining + page isolation cleanups | expand

Message

David Hildenbrand Oct. 21, 2019, 5:23 p.m. UTC
Two cleanups that popped up while working on (and discussing) virtio-mem:
 https://lkml.org/lkml/2019/9/19/463

Tested with DIMMs on x86.

As discussed with michal in v1, I'll soon look into removing the use
of PG_reserved during memory onlining completely - most probably
disallowing to offline memory blocks with holes, cleaning up the
onlining+offlining code.

v1 -> v2:
- "mm/page_alloc.c: Don't set pages PageReserved() when offlining"
-- Fixup one comment
- "mm/page_isolation.c: Convert SKIP_HWPOISON to MEMORY_OFFLINE"
-- Use parenthesis around checks
- Added ACKs

David Hildenbrand (2):
  mm/page_alloc.c: Don't set pages PageReserved() when offlining
  mm/page_isolation.c: Convert SKIP_HWPOISON to MEMORY_OFFLINE

 include/linux/page-isolation.h |  4 ++--
 mm/memory_hotplug.c            | 12 ++++++------
 mm/page_alloc.c                |  9 +++------
 mm/page_isolation.c            | 12 ++++++------
 4 files changed, 17 insertions(+), 20 deletions(-)

Comments

David Hildenbrand Oct. 22, 2019, 6:52 a.m. UTC | #1
On 21.10.19 19:23, David Hildenbrand wrote:
> Two cleanups that popped up while working on (and discussing) virtio-mem:
>   https://lkml.org/lkml/2019/9/19/463
> 
> Tested with DIMMs on x86.
> 
> As discussed with michal in v1, I'll soon look into removing the use
> of PG_reserved during memory onlining completely - most probably
> disallowing to offline memory blocks with holes, cleaning up the
> onlining+offlining code.

BTW, I remember that ZONE_DEVICE pages are still required to be set 
PG_reserved. That has to be sorted out first. I remember that somebody 
was working on it a while ago but didn't hear about that again. Will 
look into that as well - should be as easy as adding a zone check (if 
there isn't a pfn_to_online_page() check already). But of course, there 
might be special cases ....
Michal Hocko Oct. 22, 2019, 8:08 a.m. UTC | #2
On Tue 22-10-19 08:52:28, David Hildenbrand wrote:
> On 21.10.19 19:23, David Hildenbrand wrote:
> > Two cleanups that popped up while working on (and discussing) virtio-mem:
> >   https://lkml.org/lkml/2019/9/19/463
> > 
> > Tested with DIMMs on x86.
> > 
> > As discussed with michal in v1, I'll soon look into removing the use
> > of PG_reserved during memory onlining completely - most probably
> > disallowing to offline memory blocks with holes, cleaning up the
> > onlining+offlining code.
> 
> BTW, I remember that ZONE_DEVICE pages are still required to be set
> PG_reserved. That has to be sorted out first.

Do they?

> I remember that somebody was
> working on it a while ago but didn't hear about that again. Will look into
> that as well - should be as easy as adding a zone check (if there isn't a
> pfn_to_online_page() check already). But of course, there might be special
> cases ....

I remember Alexander didn't want to change the PageReserved handling
because he was worried about unforeseeable side effects. I have a vague
recollection he (or maybe Dan) has promissed some follow up clean ups
which didn't seem to materialize.
David Hildenbrand Oct. 22, 2019, 8:15 a.m. UTC | #3
On 22.10.19 10:08, Michal Hocko wrote:
> On Tue 22-10-19 08:52:28, David Hildenbrand wrote:
>> On 21.10.19 19:23, David Hildenbrand wrote:
>>> Two cleanups that popped up while working on (and discussing) virtio-mem:
>>>    https://lkml.org/lkml/2019/9/19/463
>>>
>>> Tested with DIMMs on x86.
>>>
>>> As discussed with michal in v1, I'll soon look into removing the use
>>> of PG_reserved during memory onlining completely - most probably
>>> disallowing to offline memory blocks with holes, cleaning up the
>>> onlining+offlining code.
>>
>> BTW, I remember that ZONE_DEVICE pages are still required to be set
>> PG_reserved. That has to be sorted out first.
> 
> Do they?

Yes, especially KVM code :/

> 
>> I remember that somebody was
>> working on it a while ago but didn't hear about that again. Will look into
>> that as well - should be as easy as adding a zone check (if there isn't a
>> pfn_to_online_page() check already). But of course, there might be special
>> cases ....
> 
> I remember Alexander didn't want to change the PageReserved handling
> because he was worried about unforeseeable side effects. I have a vague
> recollection he (or maybe Dan) has promissed some follow up clean ups
> which didn't seem to materialize.

I'm looking into it right now, especially the KVM part.
Michal Hocko Oct. 22, 2019, 8:21 a.m. UTC | #4
On Tue 22-10-19 10:15:07, David Hildenbrand wrote:
> On 22.10.19 10:08, Michal Hocko wrote:
> > On Tue 22-10-19 08:52:28, David Hildenbrand wrote:
> > > On 21.10.19 19:23, David Hildenbrand wrote:
> > > > Two cleanups that popped up while working on (and discussing) virtio-mem:
> > > >    https://lkml.org/lkml/2019/9/19/463
> > > > 
> > > > Tested with DIMMs on x86.
> > > > 
> > > > As discussed with michal in v1, I'll soon look into removing the use
> > > > of PG_reserved during memory onlining completely - most probably
> > > > disallowing to offline memory blocks with holes, cleaning up the
> > > > onlining+offlining code.
> > > 
> > > BTW, I remember that ZONE_DEVICE pages are still required to be set
> > > PG_reserved. That has to be sorted out first.
> > 
> > Do they?
> 
> Yes, especially KVM code :/

Details please?
David Hildenbrand Oct. 22, 2019, 8:32 a.m. UTC | #5
On 22.10.19 10:21, Michal Hocko wrote:
> On Tue 22-10-19 10:15:07, David Hildenbrand wrote:
>> On 22.10.19 10:08, Michal Hocko wrote:
>>> On Tue 22-10-19 08:52:28, David Hildenbrand wrote:
>>>> On 21.10.19 19:23, David Hildenbrand wrote:
>>>>> Two cleanups that popped up while working on (and discussing) virtio-mem:
>>>>>     https://lkml.org/lkml/2019/9/19/463
>>>>>
>>>>> Tested with DIMMs on x86.
>>>>>
>>>>> As discussed with michal in v1, I'll soon look into removing the use
>>>>> of PG_reserved during memory onlining completely - most probably
>>>>> disallowing to offline memory blocks with holes, cleaning up the
>>>>> onlining+offlining code.
>>>>
>>>> BTW, I remember that ZONE_DEVICE pages are still required to be set
>>>> PG_reserved. That has to be sorted out first.
>>>
>>> Do they?
>>
>> Yes, especially KVM code :/
> 
> Details please?
> 


E.g., arch/x86/kvm/mmu.c:kvm_is_mmio_pfn()

And I currently have


From 55606751b67989bd06d17844a6bcfbf85d44ee69 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Tue, 22 Oct 2019 10:08:04 +0200
Subject: [PATCH v1] KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved
 changes

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that in the future.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap - however, there is no reliable and
fast check to detect memmaps that were initialized and are ZONE_DEVICE.

Let's rewrite kvm_is_mmio_pfn() so we really only touch initialized
memmaps that are guaranteed to not contain garbage. Make sure that
RAM without a memmap is still not detected as MMIO and that ZONE_DEVICE
that is not UC/UC-/WC is not detected as MMIO.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/kvm/mmu.c | 38 ++++++++++++++++++++++----------------
 1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24c23c66b226..c91c9a5d14dc 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2962,23 +2962,29 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
 
 static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 {
-	if (pfn_valid(pfn))
-		return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
-			/*
-			 * Some reserved pages, such as those from NVDIMM
-			 * DAX devices, are not for MMIO, and can be mapped
-			 * with cached memory type for better performance.
-			 * However, the above check misconceives those pages
-			 * as MMIO, and results in KVM mapping them with UC
-			 * memory type, which would hurt the performance.
-			 * Therefore, we check the host memory type in addition
-			 * and only treat UC/UC-/WC pages as MMIO.
-			 */
-			(!pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn));
+	struct page *page = pfn_to_online_page(pfn);
+
+	/*
+	 * Online pages consist of pages managed by the buddy. Especially,
+	 * ZONE_DEVICE pages are never online. Online pages that are reserved
+	 * indicate the zero page and MMIO pages.
+	 */
+	if (page)
+		return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
 
-	return !e820__mapped_raw_any(pfn_to_hpa(pfn),
-				     pfn_to_hpa(pfn + 1) - 1,
-				     E820_TYPE_RAM);
+	/*
+	 * Any RAM that is not online (e.g., mapped via /dev/mem without
+	 * a memmap or with an uninitialized memmap) is not MMIO.
+	 */
+	if (e820__mapped_raw_any(pfn_to_hpa(pfn), pfn_to_hpa(pfn + 1) - 1,
+				 E820_TYPE_RAM))
+		return false;
+
+	/*
+	 * Finally, anything with a valid memmap could be ZONE_DEVICE - or the
+	 * memmap could be uninitialized. Treat only UC/UC-/WC pages as MMIO.
+	 */
+	return pfn_valid() && !pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn);
 }
 
 /* Bits which may be returned by set_spte() */
David Hildenbrand Oct. 22, 2019, 8:38 a.m. UTC | #6
On 22.10.19 10:32, David Hildenbrand wrote:
> On 22.10.19 10:21, Michal Hocko wrote:
>> On Tue 22-10-19 10:15:07, David Hildenbrand wrote:
>>> On 22.10.19 10:08, Michal Hocko wrote:
>>>> On Tue 22-10-19 08:52:28, David Hildenbrand wrote:
>>>>> On 21.10.19 19:23, David Hildenbrand wrote:
>>>>>> Two cleanups that popped up while working on (and discussing) virtio-mem:
>>>>>>      https://lkml.org/lkml/2019/9/19/463
>>>>>>
>>>>>> Tested with DIMMs on x86.
>>>>>>
>>>>>> As discussed with michal in v1, I'll soon look into removing the use
>>>>>> of PG_reserved during memory onlining completely - most probably
>>>>>> disallowing to offline memory blocks with holes, cleaning up the
>>>>>> onlining+offlining code.
>>>>>
>>>>> BTW, I remember that ZONE_DEVICE pages are still required to be set
>>>>> PG_reserved. That has to be sorted out first.
>>>>
>>>> Do they?
>>>
>>> Yes, especially KVM code :/
>>
>> Details please?
>>

Oh, and I think you might be wondering "how can we have RAM without a 
memmap in the guest", see

https://lwn.net/Articles/778240/
Michal Hocko Oct. 22, 2019, 9:14 a.m. UTC | #7
On Tue 22-10-19 10:32:11, David Hildenbrand wrote:
[...]
> E.g., arch/x86/kvm/mmu.c:kvm_is_mmio_pfn()

Thanks for these references. I am not really familiar with kvm so I
cannot really comment on the specific code but I am wondering why
it simply doesn't check for ZONE_DEVICE explicitly? Also we do care
about holes in RAM (from the early boot), those should be reserved
already AFAIR. So we are left with hotplugged memory with holes and
I am not really sure we should bother with this until there is a clear
usecase in sight.
David Hildenbrand Oct. 22, 2019, 9:17 a.m. UTC | #8
On 22.10.19 11:14, Michal Hocko wrote:
> On Tue 22-10-19 10:32:11, David Hildenbrand wrote:
> [...]
>> E.g., arch/x86/kvm/mmu.c:kvm_is_mmio_pfn()
> 
> Thanks for these references. I am not really familiar with kvm so I
> cannot really comment on the specific code but I am wondering why
> it simply doesn't check for ZONE_DEVICE explicitly? Also we do care
> about holes in RAM (from the early boot), those should be reserved
> already AFAIR. So we are left with hotplugged memory with holes and
> I am not really sure we should bother with this until there is a clear
> usecase in sight.

Well, checking for ZONE_DEVICE is only possible if you have an 
initialized memmap. And that is not guaranteed when you start mapping 
random stuff into your guest via /dev/mem.

I am reworking these patches right now and audit the whole kernel for 
PageReserved() checks that might affect ZONE_DEVICE. I'll send the 
collection of patches as RFC.
Michal Hocko Oct. 22, 2019, 9:24 a.m. UTC | #9
On Tue 22-10-19 11:17:24, David Hildenbrand wrote:
> On 22.10.19 11:14, Michal Hocko wrote:
> > On Tue 22-10-19 10:32:11, David Hildenbrand wrote:
> > [...]
> > > E.g., arch/x86/kvm/mmu.c:kvm_is_mmio_pfn()
> > 
> > Thanks for these references. I am not really familiar with kvm so I
> > cannot really comment on the specific code but I am wondering why
> > it simply doesn't check for ZONE_DEVICE explicitly? Also we do care
> > about holes in RAM (from the early boot), those should be reserved
> > already AFAIR. So we are left with hotplugged memory with holes and
> > I am not really sure we should bother with this until there is a clear
> > usecase in sight.
> 
> Well, checking for ZONE_DEVICE is only possible if you have an initialized
> memmap. And that is not guaranteed when you start mapping random stuff into
> your guest via /dev/mem.

Yes, I can understand that part but checking PageReserved on an
uninitialized memmap is pointless as well. So if you can test for it you
can very well test for ZONE_DEVICE as well. PageReserved -> ZONE_DEVICE
is a terrible assumption.

> I am reworking these patches right now and audit the whole kernel for
> PageReserved() checks that might affect ZONE_DEVICE. I'll send the
> collection of patches as RFC.

Thanks a lot!
David Hildenbrand Oct. 22, 2019, 9:27 a.m. UTC | #10
On 22.10.19 11:24, Michal Hocko wrote:
> On Tue 22-10-19 11:17:24, David Hildenbrand wrote:
>> On 22.10.19 11:14, Michal Hocko wrote:
>>> On Tue 22-10-19 10:32:11, David Hildenbrand wrote:
>>> [...]
>>>> E.g., arch/x86/kvm/mmu.c:kvm_is_mmio_pfn()
>>>
>>> Thanks for these references. I am not really familiar with kvm so I
>>> cannot really comment on the specific code but I am wondering why
>>> it simply doesn't check for ZONE_DEVICE explicitly? Also we do care
>>> about holes in RAM (from the early boot), those should be reserved
>>> already AFAIR. So we are left with hotplugged memory with holes and
>>> I am not really sure we should bother with this until there is a clear
>>> usecase in sight.
>>
>> Well, checking for ZONE_DEVICE is only possible if you have an initialized
>> memmap. And that is not guaranteed when you start mapping random stuff into
>> your guest via /dev/mem.
> 
> Yes, I can understand that part but checking PageReserved on an
> uninitialized memmap is pointless as well. So if you can test for it you

That's why I add pfn_to_online_page() :)

> can very well test for ZONE_DEVICE as well. PageReserved -> ZONE_DEVICE
> is a terrible assumption.
Indeed, it is. But there are more parts in the kernel that I'll be 
fixing. Stay tuned.