[RFC,v1,00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)

Message ID	20191022171239.21487-1-david@redhat.com (mailing list archive)
Headers	show Return-Path: <SRS0=TltB=YP=lists.xenproject.org=xen-devel-bounces@kernel.org> Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 22D9114ED for <patchwork-xen-devel@patchwork.kernel.org>; Tue, 22 Oct 2019 17:15:05 +0000 (UTC) Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id E719D20684 for <patchwork-xen-devel@patchwork.kernel.org>; Tue, 22 Oct 2019 17:15:04 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="AKXKGYz4" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E719D20684 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=xen-devel-bounces@lists.xenproject.org Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.89) (envelope-from <xen-devel-bounces@lists.xenproject.org>) id 1iMxil-000609-EZ; Tue, 22 Oct 2019 17:13:23 +0000 Received: from us1-rack-iad1.inumbo.com ([172.99.69.81]) by lists.xenproject.org with esmtp (Exim 4.89) (envelope-from <SRS0=I1Ch=YP=redhat.com=david@srs-us1.protection.inumbo.net>) id 1iMxik-000603-Kz for xen-devel@lists.xenproject.org; Tue, 22 Oct 2019 17:13:22 +0000 X-Inumbo-ID: 3e892016-f4ef-11e9-bbab-bc764e2007e4 Received: from us-smtp-1.mimecast.com (unknown [207.211.31.81]) by us1-rack-iad1.inumbo.com (Halon) with ESMTP id 3e892016-f4ef-11e9-bbab-bc764e2007e4; Tue, 22 Oct 2019 17:13:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1571764397; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=lOu6AQP2dyv/4LtgwGlaMv6HFBAdtkUQmm1EyLtCFNM=; b=AKXKGYz40SwBaJnfOTKdjQKuO4UjeTEvm5sphoNdDojZ34kFYKXstB8l1qTF2y2X39X8Mr SBTUwr/STXCIDfsSSdC+6YWCr5eJn+Y3Nj8/wzKGGlzSmh9lApYxlUigcSxWvA5tan1j2y XjtHy4ScFtpo1fc9x1IJLN6bUwkdodc= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-263-9EHfARfbNLK-zZYBJq1aBA-1; Tue, 22 Oct 2019 13:13:15 -0400 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 7918A47B; Tue, 22 Oct 2019 17:13:07 +0000 (UTC) Received: from t460s.redhat.com (ovpn-116-248.ams2.redhat.com [10.36.116.248]) by smtp.corp.redhat.com (Postfix) with ESMTP id 5B8BC5DD79; Tue, 22 Oct 2019 17:12:40 +0000 (UTC) From: David Hildenbrand <david@redhat.com> To: linux-kernel@vger.kernel.org Date: Tue, 22 Oct 2019 19:12:27 +0200 Message-Id: <20191022171239.21487-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-MC-Unique: 9EHfARfbNLK-zZYBJq1aBA-1 X-Mimecast-Spam-Score: 0 Subject: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) X-BeenThere: xen-devel@lists.xenproject.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Xen developer discussion <xen-devel.lists.xenproject.org> List-Unsubscribe: <https://lists.xenproject.org/mailman/options/xen-devel>, <mailto:xen-devel-request@lists.xenproject.org?subject=unsubscribe> List-Post: <mailto:xen-devel@lists.xenproject.org> List-Help: <mailto:xen-devel-request@lists.xenproject.org?subject=help> List-Subscribe: <https://lists.xenproject.org/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xenproject.org?subject=subscribe> Cc: Kate Stewart <kstewart@linuxfoundation.org>, Sasha Levin <sashal@kernel.org>, linux-hyperv@vger.kernel.org, Michal Hocko <mhocko@suse.com>, =?utf-8?b?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>, kvm@vger.kernel.org, David Hildenbrand <david@redhat.com>, KarimAllah Ahmed <karahmed@amazon.de>, Benjamin Herrenschmidt <benh@kernel.crashing.org>, Dave Hansen <dave.hansen@linux.intel.com>, Alexander Duyck <alexander.duyck@gmail.com>, Michal Hocko <mhocko@kernel.org>, Paul Mackerras <paulus@ozlabs.org>, linux-mm@kvack.org, Pavel Tatashin <pavel.tatashin@microsoft.com>, Paul Mackerras <paulus@samba.org>, Michael Ellerman <mpe@ellerman.id.au>, "H. Peter Anvin" <hpa@zytor.com>, Wanpeng Li <wanpengli@tencent.com>, Alexander Duyck <alexander.h.duyck@linux.intel.com>, "K. Y. Srinivasan" <kys@microsoft.com>, Fabio Estevam <festevam@gmail.com>, Ben Chan <benchan@chromium.org>, Kees Cook <keescook@chromium.org>, devel@driverdev.osuosl.org, Stefano Stabellini <sstabellini@kernel.org>, Stephen Hemminger <sthemmin@microsoft.com>, "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>, Joerg Roedel <joro@8bytes.org>, x86@kernel.org, YueHaibing <yuehaibing@huawei.com>, Mike Rapoport <rppt@linux.ibm.com>, Madhumitha Prabakaran <madhumithabiw@gmail.com>, Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>, Vlastimil Babka <vbabka@suse.cz>, Nishka Dasgupta <nishkadg.linux@gmail.com>, Anthony Yznaga <anthony.yznaga@oracle.com>, Oscar Salvador <osalvador@suse.de>, Dan Carpenter <dan.carpenter@oracle.com>, "Isaac J. Manjarres" <isaacm@codeaurora.org>, Matt Sickler <Matt.Sickler@daktronics.com>, Juergen Gross <jgross@suse.com>, Anshuman Khandual <anshuman.khandual@arm.com>, Haiyang Zhang <haiyangz@microsoft.com>, =?utf-8?q?Simon_Sandstr=C3=B6m?= <simon@nikanor.nu>, Dan Williams <dan.j.williams@intel.com>, kvm-ppc@vger.kernel.org, Qian Cai <cai@lca.pw>, Alex Williamson <alex.williamson@redhat.com>, Mike Rapoport <rppt@linux.vnet.ibm.com>, Borislav Petkov <bp@alien8.de>, Nicholas Piggin <npiggin@gmail.com>, Andy Lutomirski <luto@kernel.org>, xen-devel@lists.xenproject.org, Boris Ostrovsky <boris.ostrovsky@oracle.com>, Todd Poynor <toddpoynor@google.com>, Vitaly Kuznetsov <vkuznets@redhat.com>, Allison Randal <allison@lohutok.net>, Jim Mattson <jmattson@google.com>, Christophe Leroy <christophe.leroy@c-s.fr>, Vandana BN <bnvandana@gmail.com>, Jeremy Sowden <jeremy@azazel.net>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Cornelia Huck <cohuck@redhat.com>, Pavel Tatashin <pasha.tatashin@soleen.com>, Mel Gorman <mgorman@techsingularity.net>, Sean Christopherson <sean.j.christopherson@intel.com>, Rob Springer <rspringer@google.com>, Thomas Gleixner <tglx@linutronix.de>, Johannes Weiner <hannes@cmpxchg.org>, Paolo Bonzini <pbonzini@redhat.com>, Andrew Morton <akpm@linux-foundation.org>, linuxppc-dev@lists.ozlabs.org Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>
Series	mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) \| expand [RFC,v1,00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) [RFC,v1,01/12] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes [RFC,v1,02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes [RFC,v1,03/12] KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes [RFC,v1,04/12] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes [RFC,v1,05/12] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes [RFC,v1,06/12] staging/gasket: Prepare gasket_release_page() for PG_reserved changes [RFC,v1,07/12] staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes [RFC,v1,08/12] powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() for PG_reserved changes [RFC,v1,09/12] powerpc/64s: Prepare hash_page_do_lazy_icache() for PG_reserved changes [RFC,v1,10/12] powerpc/mm: Prepare maybe_pte_to_page() for PG_reserved changes [RFC,v1,11/12] x86/mm: Prepare __ioremap_check_ram() for PG_reserved changes [RFC,v1,12/12] mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap

David Hildenbrand Oct. 22, 2019, 5:12 p.m. UTC

This series is based on [2], which should pop up in linux/next soon:
	https://lkml.org/lkml/2019/10/21/1034

This is the result of a recent discussion with Michal ([1], [2]). Right
now we set all pages PG_reserved when initializing hotplugged memmaps. This
includes ZONE_DEVICE memory. In case of system memory, PG_reserved is
cleared again when onlining the memory, in case of ZONE_DEVICE memory
never. In ancient times, we needed PG_reserved, because there was no way
to tell whether the memmap was already properly initialized. We now have
SECTION_IS_ONLINE for that in the case of !ZONE_DEVICE memory. ZONE_DEVICE
memory is already initialized deferred, and there shouldn't be a visible
change in that regard.

I remember that some time ago, we already talked about stopping to set
ZONE_DEVICE pages PG_reserved on the list, but I never saw any patches.
Also, I forgot who was part of the discussion :)

One of the biggest fear were side effects. I went ahead and audited all
users of PageReserved(). The ones that don't need any care (patches)
can be found below. I will double check and hope I am not missing something
important.

I am probably a little bit too careful (but I don't want to break things).
In most places (besides KVM and vfio that are nuts), the
pfn_to_online_page() check could most probably be avoided by a
is_zone_device_page() check. However, I usually get suspicious when I see
a pfn_valid() check (especially after I learned that people mmap parts of
/dev/mem into user space, including memory without memmaps. Also, people
could memmap offline memory blocks this way :/). As long as this does not
hurt performance, I think we should rather do it the clean way.

I only gave it a quick test with DIMMs on x86-64, but didn't test the
ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
on x86-64 and PPC.

Other users of PageReserved() that should be fine:
- mm/page_owner.c:pagetypeinfo_showmixedcount_print()
  -> Never called for ZONE_DEVICE, (+ pfn_to_online_page(pfn))
- mm/page_owner.c:init_pages_in_zone()
  -> Never called for ZONE_DEVICE (!populated_zone(zone))
- mm/page_ext.c:free_page_ext()
  -> Only a BUG_ON(PageReserved(page)), not relevant
- mm/page_ext.c:has_unmovable_pages()
  -> Not releveant for ZONE_DEVICE
- mm/page_ext.c:pfn_range_valid_contig()
  -> pfn_to_online_page() already guards us
- mm/mempolicy.c:queue_pages_pte_range()
  -> vm_normal_page() checks against pte_devmap()
- mm/memory-failure.c:hwpoison_user_mappings()
  -> Not reached via memory_failure() due to pfn_to_online_page()
  -> Also not reached indirectly via memory_failure_hugetlb()
- mm/hugetlb.c:gather_bootmem_prealloc()
  -> Only a WARN_ON(PageReserved(page)), not relevant
- kernel/power/snapshot.c:saveable_highmem_page()
  -> pfn_to_online_page() already guards us
- kernel/power/snapshot.c:saveable_page()
  -> pfn_to_online_page() already guards us
- fs/proc/task_mmu.c:can_gather_numa_stats()
  -> vm_normal_page() checks against pte_devmap()
- fs/proc/task_mmu.c:can_gather_numa_stats_pmd
  -> vm_normal_page_pmd() checks against pte_devmap()
- fs/proc/page.c:stable_page_flags()
  -> The reserved bit is simply copied, irrelevant
- drivers/firmware/memmap.c:release_firmware_map_entry()
  -> really only a check to detect bootmem. Not relevant for ZONE_DEVICE
- arch/ia64/kernel/mca_drv.c
- arch/mips/mm/init.c
- arch/mips/mm/ioremap.c
- arch/nios2/mm/ioremap.c
- arch/parisc/mm/ioremap.c
- arch/sparc/mm/tlb.c
- arch/xtensa/mm/cache.c
  -> No ZONE_DEVICE support
- arch/powerpc/mm/init_64.c:vmemmap_free()
  -> Special-cases memmap on altmap
  -> Only a check for bootmem
- arch/x86/kernel/alternative.c:__text_poke()
  -> Only a WARN_ON(!PageReserved(pages[0])) to verify it is bootmem
- arch/x86/mm/init_64.c
  -> Only a check for bootmem

[1] https://lkml.org/lkml/2019/10/21/736
[2] https://lkml.org/lkml/2019/10/21/1034

Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: kvm-ppc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: kvm@vger.kernel.org
Cc: linux-hyperv@vger.kernel.org
Cc: devel@driverdev.osuosl.org
Cc: xen-devel@lists.xenproject.org
Cc: x86@kernel.org
Cc: Alexander Duyck <alexander.duyck@gmail.com>

David Hildenbrand (12):
  mm/memory_hotplug: Don't allow to online/offline memory blocks with
    holes
  mm/usercopy.c: Prepare check_page_span() for PG_reserved changes
  KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes
  KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
  staging/gasket: Prepare gasket_release_page() for PG_reserved changes
  staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved
    changes
  powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() for
    PG_reserved changes
  powerpc/64s: Prepare hash_page_do_lazy_icache() for PG_reserved
    changes
  powerpc/mm: Prepare maybe_pte_to_page() for PG_reserved changes
  x86/mm: Prepare __ioremap_check_ram() for PG_reserved changes
  mm/memory_hotplug: Don't mark pages PG_reserved when initializing the
    memmap

 arch/powerpc/kvm/book3s_64_mmu_radix.c     | 14 ++++---
 arch/powerpc/mm/book3s64/hash_utils.c      | 10 +++--
 arch/powerpc/mm/pgtable.c                  | 10 +++--
 arch/x86/kvm/mmu.c                         | 30 +++++++++------
 arch/x86/mm/ioremap.c                      | 13 +++++--
 drivers/hv/hv_balloon.c                    |  6 +++
 drivers/staging/gasket/gasket_page_table.c |  2 +-
 drivers/staging/kpc2000/kpc_dma/fileops.c  |  3 +-
 drivers/vfio/vfio_iommu_type1.c            | 10 ++++-
 drivers/xen/balloon.c                      |  7 ++++
 include/linux/page-flags.h                 |  8 +---
 mm/memory_hotplug.c                        | 43 ++++++++++++++++------
 mm/page_alloc.c                            | 11 ------
 mm/usercopy.c                              |  5 ++-
 virt/kvm/kvm_main.c                        | 10 ++++-
 15 files changed, 115 insertions(+), 67 deletions(-)

Dan Williams Oct. 22, 2019, 9:54 p.m. UTC | #1

Hi David,

Thanks for tackling this!

On Tue, Oct 22, 2019 at 10:13 AM David Hildenbrand <david@redhat.com> wrote:
>
> This series is based on [2], which should pop up in linux/next soon:
>         https://lkml.org/lkml/2019/10/21/1034
>
> This is the result of a recent discussion with Michal ([1], [2]). Right
> now we set all pages PG_reserved when initializing hotplugged memmaps. This
> includes ZONE_DEVICE memory. In case of system memory, PG_reserved is
> cleared again when onlining the memory, in case of ZONE_DEVICE memory
> never. In ancient times, we needed PG_reserved, because there was no way
> to tell whether the memmap was already properly initialized. We now have
> SECTION_IS_ONLINE for that in the case of !ZONE_DEVICE memory. ZONE_DEVICE
> memory is already initialized deferred, and there shouldn't be a visible
> change in that regard.
>
> I remember that some time ago, we already talked about stopping to set
> ZONE_DEVICE pages PG_reserved on the list, but I never saw any patches.
> Also, I forgot who was part of the discussion :)

You got me, Alex, and KVM folks on the Cc, so I'd say that was it.

> One of the biggest fear were side effects. I went ahead and audited all
> users of PageReserved(). The ones that don't need any care (patches)
> can be found below. I will double check and hope I am not missing something
> important.
>
> I am probably a little bit too careful (but I don't want to break things).
> In most places (besides KVM and vfio that are nuts), the
> pfn_to_online_page() check could most probably be avoided by a
> is_zone_device_page() check. However, I usually get suspicious when I see
> a pfn_valid() check (especially after I learned that people mmap parts of
> /dev/mem into user space, including memory without memmaps. Also, people
> could memmap offline memory blocks this way :/). As long as this does not
> hurt performance, I think we should rather do it the clean way.

I'm concerned about using is_zone_device_page() in places that are not
known to already have a reference to the page. Here's an audit of
current usages, and the ones I think need to cleaned up. The "unsafe"
ones do not appear to have any protections against the device page
being removed (get_dev_pagemap()). Yes, some of these were added by
me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
pages into anonymous memory paths and I'm not up to speed on how it
guarantees 'struct page' validity vs device shutdown without using
get_dev_pagemap().

smaps_pmd_entry(): unsafe

put_devmap_managed_page(): safe, page reference is held

is_device_private_page(): safe? gpu driver manages private page lifetime

is_pci_p2pdma_page(): safe, page reference is held

uncharge_page(): unsafe? HMM

add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()

soft_offline_page(): unsafe

remove_migration_pte(): unsafe? HMM

move_to_new_page(): unsafe? HMM

migrate_vma_pages() and helpers: unsafe? HMM

try_to_unmap_one(): unsafe? HMM

__put_page(): safe

release_pages(): safe

I'm hoping all the HMM ones can be converted to
is_device_private_page() directlly and have that routine grow a nice
comment about how it knows it can always safely de-reference its @page
argument.

For the rest I'd like to propose that we add a facility to determine
ZONE_DEVICE by pfn rather than page. The most straightforward why I
can think of would be to just add another bitmap to mem_section_usage
to indicate if a subsection is ZONE_DEVICE or not.

>
> I only gave it a quick test with DIMMs on x86-64, but didn't test the
> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
> on x86-64 and PPC.

I'll give it a spin, but I don't think the kernel wants to grow more
is_zone_device_page() users.

David Hildenbrand Oct. 23, 2019, 7:26 a.m. UTC | #2

On 22.10.19 23:54, Dan Williams wrote:
> Hi David,
> 
> Thanks for tackling this!

Thanks for having a look :)

[...]


>> I am probably a little bit too careful (but I don't want to break things).
>> In most places (besides KVM and vfio that are nuts), the
>> pfn_to_online_page() check could most probably be avoided by a
>> is_zone_device_page() check. However, I usually get suspicious when I see
>> a pfn_valid() check (especially after I learned that people mmap parts of
>> /dev/mem into user space, including memory without memmaps. Also, people
>> could memmap offline memory blocks this way :/). As long as this does not
>> hurt performance, I think we should rather do it the clean way.
> 
> I'm concerned about using is_zone_device_page() in places that are not
> known to already have a reference to the page. Here's an audit of
> current usages, and the ones I think need to cleaned up. The "unsafe"
> ones do not appear to have any protections against the device page
> being removed (get_dev_pagemap()). Yes, some of these were added by
> me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
> pages into anonymous memory paths and I'm not up to speed on how it
> guarantees 'struct page' validity vs device shutdown without using
> get_dev_pagemap().
> 
> smaps_pmd_entry(): unsafe
> 
> put_devmap_managed_page(): safe, page reference is held
> 
> is_device_private_page(): safe? gpu driver manages private page lifetime
> 
> is_pci_p2pdma_page(): safe, page reference is held
> 
> uncharge_page(): unsafe? HMM
> 
> add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()
> 
> soft_offline_page(): unsafe
> 
> remove_migration_pte(): unsafe? HMM
> 
> move_to_new_page(): unsafe? HMM
> 
> migrate_vma_pages() and helpers: unsafe? HMM
> 
> try_to_unmap_one(): unsafe? HMM
> 
> __put_page(): safe
> 
> release_pages(): safe
> 
> I'm hoping all the HMM ones can be converted to
> is_device_private_page() directlly and have that routine grow a nice
> comment about how it knows it can always safely de-reference its @page
> argument.
> 
> For the rest I'd like to propose that we add a facility to determine
> ZONE_DEVICE by pfn rather than page. The most straightforward why I
> can think of would be to just add another bitmap to mem_section_usage
> to indicate if a subsection is ZONE_DEVICE or not.

(it's a somewhat unrelated bigger discussion, but we can start discussing it in this thread)

I dislike this for three reasons

a) It does not protect against any races, really, it does not improve things.
b) We do have the exact same problem with pfn_to_online_page(). As long as we
   don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
c) We mix in ZONE specific stuff into the core. It should be "just another zone"

What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)

1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
2. Convert SECTION_IS_ACTIVE to a subsection bitmap
3. Introduce pfn_active() that checks against the subsection bitmap
4. Once the memmap was initialized / prepared, set the subsection active
   (similar to SECTION_IS_ONLINE in the buddy right now)
5. Before the memmap gets invalidated, set the subsection inactive
   (similar to SECTION_IS_ONLINE in the buddy right now)
5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE

Especially, driver-reserved device memory will not get set active in
the subsection bitmap.

Now to the race. Taking the memory hotplug lock at random places is ugly. I do
wonder if we can use RCU:

The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():

	/* the memmap is guaranteed to remain active under RCU */
	rcu_read_lock();
	if (pfn_active(random_pfn)) {
		page = pfn_to_page(random_pfn);
		... use the page, stays valid
	}
	rcu_unread_lock();

Memory offlining/memremap code:

	set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically */
	synchronize_rcu();
	/* all users saw the bitmap update, we can invalide the memmap */
	remove_pfn_range_from_zone(zone, pfn, nr_pages);

> 
>>
>> I only gave it a quick test with DIMMs on x86-64, but didn't test the
>> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
>> on x86-64 and PPC.
> 
> I'll give it a spin, but I don't think the kernel wants to grow more
> is_zone_device_page() users.

Let's recap. In this RFC, I introduce a total of 4 (!) users only.
The other parts can rely on pfn_to_online_page() only.

1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
- Basically never used with ZONE_DEVICE.
- We hold a reference!
- All it protects is a SetPageDirty(page);

2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
- Same as 1.

3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
- We come via virt_to_head_page() / virt_to_head_page(), not sure about 
  references (I assume this should be fine as we don't come via random 
  PFNs)
- We check that we don't mix Reserved (including device memory) and CMA 
  pages when crossing compound pages.

I think we can drop 1. and 2., resulting in a total of 2 new users in
the same context. I think that is totally tolerable to finally clean
this up.


However, I think we also have to clarify if we need the change in 3 at all.
It comes from

commit f5509cc18daa7f82bcc553be70df2117c8eedc16
Author: Kees Cook <keescook@chromium.org>
Date:   Tue Jun 7 11:05:33 2016 -0700

    mm: Hardened usercopy
    
    This is the start of porting PAX_USERCOPY into the mainline kernel. This
    is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
    work is based on code by PaX Team and Brad Spengler, and an earlier port
    from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
[...]
    - otherwise, object must not span page allocations (excepting Reserved
      and CMA ranges)

Not sure if we really have to care about ZONE_DEVICE at this point.

Dan Williams Oct. 23, 2019, 5:09 p.m. UTC | #3

On Wed, Oct 23, 2019 at 12:26 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 22.10.19 23:54, Dan Williams wrote:
> > Hi David,
> >
> > Thanks for tackling this!
>
> Thanks for having a look :)
>
> [...]
>
>
> >> I am probably a little bit too careful (but I don't want to break things).
> >> In most places (besides KVM and vfio that are nuts), the
> >> pfn_to_online_page() check could most probably be avoided by a
> >> is_zone_device_page() check. However, I usually get suspicious when I see
> >> a pfn_valid() check (especially after I learned that people mmap parts of
> >> /dev/mem into user space, including memory without memmaps. Also, people
> >> could memmap offline memory blocks this way :/). As long as this does not
> >> hurt performance, I think we should rather do it the clean way.
> >
> > I'm concerned about using is_zone_device_page() in places that are not
> > known to already have a reference to the page. Here's an audit of
> > current usages, and the ones I think need to cleaned up. The "unsafe"
> > ones do not appear to have any protections against the device page
> > being removed (get_dev_pagemap()). Yes, some of these were added by
> > me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
> > pages into anonymous memory paths and I'm not up to speed on how it
> > guarantees 'struct page' validity vs device shutdown without using
> > get_dev_pagemap().
> >
> > smaps_pmd_entry(): unsafe
> >
> > put_devmap_managed_page(): safe, page reference is held
> >
> > is_device_private_page(): safe? gpu driver manages private page lifetime
> >
> > is_pci_p2pdma_page(): safe, page reference is held
> >
> > uncharge_page(): unsafe? HMM
> >
> > add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()
> >
> > soft_offline_page(): unsafe
> >
> > remove_migration_pte(): unsafe? HMM
> >
> > move_to_new_page(): unsafe? HMM
> >
> > migrate_vma_pages() and helpers: unsafe? HMM
> >
> > try_to_unmap_one(): unsafe? HMM
> >
> > __put_page(): safe
> >
> > release_pages(): safe
> >
> > I'm hoping all the HMM ones can be converted to
> > is_device_private_page() directlly and have that routine grow a nice
> > comment about how it knows it can always safely de-reference its @page
> > argument.
> >
> > For the rest I'd like to propose that we add a facility to determine
> > ZONE_DEVICE by pfn rather than page. The most straightforward why I
> > can think of would be to just add another bitmap to mem_section_usage
> > to indicate if a subsection is ZONE_DEVICE or not.
>
> (it's a somewhat unrelated bigger discussion, but we can start discussing it in this thread)
>
> I dislike this for three reasons
>
> a) It does not protect against any races, really, it does not improve things.
> b) We do have the exact same problem with pfn_to_online_page(). As long as we
>    don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.

True, we need to solve that problem too. That seems to want something
lighter weight than the hotplug lock that can be held over pfn lookups
+  use rather than requiring a page lookup in paths where it's not
clear that a page reference would prevent unplug.

> c) We mix in ZONE specific stuff into the core. It should be "just another zone"

Not sure I grok this when the RFC is sprinkling zone-specific
is_zone_device_page() throughout the core?

>
> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)

Sorry I missed this earlier...

>
> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
> 3. Introduce pfn_active() that checks against the subsection bitmap
> 4. Once the memmap was initialized / prepared, set the subsection active
>    (similar to SECTION_IS_ONLINE in the buddy right now)
> 5. Before the memmap gets invalidated, set the subsection inactive
>    (similar to SECTION_IS_ONLINE in the buddy right now)
> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE

This does not seem to reduce any complexity because it still requires
a pfn to zone lookup at the end of the process.

I.e. converting pfn_to_online_page() to use a new pfn_active()
subsection map plus looking up the zone from pfn_to_page() is more
steps than just doing a direct pfn to zone lookup. What am I missing?

>
> Especially, driver-reserved device memory will not get set active in
> the subsection bitmap.
>
> Now to the race. Taking the memory hotplug lock at random places is ugly. I do
> wonder if we can use RCU:

Ah, yes, exactly what I was thinking above.

>
> The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():
>
>         /* the memmap is guaranteed to remain active under RCU */
>         rcu_read_lock();
>         if (pfn_active(random_pfn)) {
>                 page = pfn_to_page(random_pfn);
>                 ... use the page, stays valid
>         }
>         rcu_unread_lock();
>
> Memory offlining/memremap code:
>
>         set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically */
>         synchronize_rcu();
>         /* all users saw the bitmap update, we can invalide the memmap */
>         remove_pfn_range_from_zone(zone, pfn, nr_pages);

Looks good to me.

>
> >
> >>
> >> I only gave it a quick test with DIMMs on x86-64, but didn't test the
> >> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
> >> on x86-64 and PPC.
> >
> > I'll give it a spin, but I don't think the kernel wants to grow more
> > is_zone_device_page() users.
>
> Let's recap. In this RFC, I introduce a total of 4 (!) users only.
> The other parts can rely on pfn_to_online_page() only.
>
> 1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
> - Basically never used with ZONE_DEVICE.
> - We hold a reference!
> - All it protects is a SetPageDirty(page);
>
> 2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
> - Same as 1.
>
> 3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
> - We come via virt_to_head_page() / virt_to_head_page(), not sure about
>   references (I assume this should be fine as we don't come via random
>   PFNs)
> - We check that we don't mix Reserved (including device memory) and CMA
>   pages when crossing compound pages.
>
> I think we can drop 1. and 2., resulting in a total of 2 new users in
> the same context. I think that is totally tolerable to finally clean
> this up.

...but more is_zone_device_page() doesn't "finally clean this up".
Like we discussed above it's the missing locking that's the real
cleanup, the pfn_to_online_page() internals are secondary.

>
>
> However, I think we also have to clarify if we need the change in 3 at all.
> It comes from
>
> commit f5509cc18daa7f82bcc553be70df2117c8eedc16
> Author: Kees Cook <keescook@chromium.org>
> Date:   Tue Jun 7 11:05:33 2016 -0700
>
>     mm: Hardened usercopy
>
>     This is the start of porting PAX_USERCOPY into the mainline kernel. This
>     is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
>     work is based on code by PaX Team and Brad Spengler, and an earlier port
>     from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
> [...]
>     - otherwise, object must not span page allocations (excepting Reserved
>       and CMA ranges)
>
> Not sure if we really have to care about ZONE_DEVICE at this point.

That check needs to be careful to ignore ZONE_DEVICE pages. There's
nothing wrong with a copy spanning ZONE_DEVICE and typical pages.

David Hildenbrand Oct. 23, 2019, 5:27 p.m. UTC | #4

>> I dislike this for three reasons
>>
>> a) It does not protect against any races, really, it does not improve things.
>> b) We do have the exact same problem with pfn_to_online_page(). As long as we
>>    don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
> 
> True, we need to solve that problem too. That seems to want something
> lighter weight than the hotplug lock that can be held over pfn lookups
> +  use rather than requiring a page lookup in paths where it's not
> clear that a page reference would prevent unplug.
> 
>> c) We mix in ZONE specific stuff into the core. It should be "just another zone"
> 
> Not sure I grok this when the RFC is sprinkling zone-specific
> is_zone_device_page() throughout the core?

Most users should not care about the zone. pfn_active() would be enough
in most situations, especially most PFN walkers - "this memmap is valid
and e.g., contains a valid zone ...".

> 
>>
>> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)
> 
> Sorry I missed this earlier...
> 
>>
>> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
>> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
>> 3. Introduce pfn_active() that checks against the subsection bitmap
>> 4. Once the memmap was initialized / prepared, set the subsection active
>>    (similar to SECTION_IS_ONLINE in the buddy right now)
>> 5. Before the memmap gets invalidated, set the subsection inactive
>>    (similar to SECTION_IS_ONLINE in the buddy right now)
>> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
>> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE
> 
> This does not seem to reduce any complexity because it still requires
> a pfn to zone lookup at the end of the process.
> 
> I.e. converting pfn_to_online_page() to use a new pfn_active()
> subsection map plus looking up the zone from pfn_to_page() is more
> steps than just doing a direct pfn to zone lookup. What am I missing?

That a real "pfn to zone" lookup without going via the struct page will
require to have more than just a single bitmap. IMHO, keeping the
information at a single place (memmap) is the clean thing to do (not
replicating it somewhere else). Going via the memmap might not be as
fast as a direct lookup, but do we actually care? We are already looking
at "random PFNs we are not sure if there is a valid memmap".

>>
>> Especially, driver-reserved device memory will not get set active in
>> the subsection bitmap.
>>
>> Now to the race. Taking the memory hotplug lock at random places is ugly. I do
>> wonder if we can use RCU:
> 
> Ah, yes, exactly what I was thinking above.
> 
>>
>> The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():
>>
>>         /* the memmap is guaranteed to remain active under RCU */
>>         rcu_read_lock();
>>         if (pfn_active(random_pfn)) {
>>                 page = pfn_to_page(random_pfn);
>>                 ... use the page, stays valid
>>         }
>>         rcu_unread_lock();
>>
>> Memory offlining/memremap code:
>>
>>         set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically */
>>         synchronize_rcu();
>>         /* all users saw the bitmap update, we can invalide the memmap */
>>         remove_pfn_range_from_zone(zone, pfn, nr_pages);
> 
> Looks good to me.
> 
>>
>>>
>>>>
>>>> I only gave it a quick test with DIMMs on x86-64, but didn't test the
>>>> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
>>>> on x86-64 and PPC.
>>>
>>> I'll give it a spin, but I don't think the kernel wants to grow more
>>> is_zone_device_page() users.
>>
>> Let's recap. In this RFC, I introduce a total of 4 (!) users only.
>> The other parts can rely on pfn_to_online_page() only.
>>
>> 1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
>> - Basically never used with ZONE_DEVICE.
>> - We hold a reference!
>> - All it protects is a SetPageDirty(page);
>>
>> 2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
>> - Same as 1.
>>
>> 3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
>> - We come via virt_to_head_page() / virt_to_head_page(), not sure about
>>   references (I assume this should be fine as we don't come via random
>>   PFNs)
>> - We check that we don't mix Reserved (including device memory) and CMA
>>   pages when crossing compound pages.
>>
>> I think we can drop 1. and 2., resulting in a total of 2 new users in
>> the same context. I think that is totally tolerable to finally clean
>> this up.
> 
> ...but more is_zone_device_page() doesn't "finally clean this up".
> Like we discussed above it's the missing locking that's the real
> cleanup, the pfn_to_online_page() internals are secondary.

It's a different cleanup IMHO. We can't do everything in one shot. But
maybe I can drop the is_zone_device_page() parts from this patch and
completely rely on pfn_to_online_page(). Yes, that needs fixing to, but
it's a different story.

The important part of this patch:

While pfn_to_online_page() will always exclude ZONE_DEVICE pages,
checking PG_reserved on ZONE_DEVICE pages (what we do right now!) is
racy as hell (especially when concurrently initializing the memmap).

This does improve the situation.

>>
>> However, I think we also have to clarify if we need the change in 3 at all.
>> It comes from
>>
>> commit f5509cc18daa7f82bcc553be70df2117c8eedc16
>> Author: Kees Cook <keescook@chromium.org>
>> Date:   Tue Jun 7 11:05:33 2016 -0700
>>
>>     mm: Hardened usercopy
>>
>>     This is the start of porting PAX_USERCOPY into the mainline kernel. This
>>     is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
>>     work is based on code by PaX Team and Brad Spengler, and an earlier port
>>     from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
>> [...]
>>     - otherwise, object must not span page allocations (excepting Reserved
>>       and CMA ranges)
>>
>> Not sure if we really have to care about ZONE_DEVICE at this point.
> 
> That check needs to be careful to ignore ZONE_DEVICE pages. There's
> nothing wrong with a copy spanning ZONE_DEVICE and typical pages.

Please note that the current check would *forbid* this (AFAIKs for a
single heap object). As discussed in the relevant patch, we might be
able to just stop doing that and limit it to real PG_reserved pages
(without ZONE_DEVICE). I'd be happy to not introduce new
is_zone_device_page() users.

Dan Williams Oct. 23, 2019, 7:39 p.m. UTC | #5

On Wed, Oct 23, 2019 at 10:28 AM David Hildenbrand <david@redhat.com> wrote:
>
> >> I dislike this for three reasons
> >>
> >> a) It does not protect against any races, really, it does not improve things.
> >> b) We do have the exact same problem with pfn_to_online_page(). As long as we
> >>    don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
> >
> > True, we need to solve that problem too. That seems to want something
> > lighter weight than the hotplug lock that can be held over pfn lookups
> > +  use rather than requiring a page lookup in paths where it's not
> > clear that a page reference would prevent unplug.
> >
> >> c) We mix in ZONE specific stuff into the core. It should be "just another zone"
> >
> > Not sure I grok this when the RFC is sprinkling zone-specific
> > is_zone_device_page() throughout the core?
>
> Most users should not care about the zone. pfn_active() would be enough
> in most situations, especially most PFN walkers - "this memmap is valid
> and e.g., contains a valid zone ...".

Oh, I see, you're saying convert most users to pfn_active() (and some
TBD rcu locking), but only pfn_to_online_page() users would need the
zone lookup? I can get on board with that.

>
> >
> >>
> >> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)
> >
> > Sorry I missed this earlier...
> >
> >>
> >> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
> >> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
> >> 3. Introduce pfn_active() that checks against the subsection bitmap
> >> 4. Once the memmap was initialized / prepared, set the subsection active
> >>    (similar to SECTION_IS_ONLINE in the buddy right now)
> >> 5. Before the memmap gets invalidated, set the subsection inactive
> >>    (similar to SECTION_IS_ONLINE in the buddy right now)
> >> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
> >> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE
> >
> > This does not seem to reduce any complexity because it still requires
> > a pfn to zone lookup at the end of the process.
> >
> > I.e. converting pfn_to_online_page() to use a new pfn_active()
> > subsection map plus looking up the zone from pfn_to_page() is more
> > steps than just doing a direct pfn to zone lookup. What am I missing?
>
> That a real "pfn to zone" lookup without going via the struct page will
> require to have more than just a single bitmap. IMHO, keeping the
> information at a single place (memmap) is the clean thing to do (not
> replicating it somewhere else). Going via the memmap might not be as
> fast as a direct lookup, but do we actually care? We are already looking
> at "random PFNs we are not sure if there is a valid memmap".

True, we only care about the validity of the check, and as you pointed
out moving the check to the pfn level does not solve the validity
race. It needs a lock.

>
> >>
> >> Especially, driver-reserved device memory will not get set active in
> >> the subsection bitmap.
> >>
> >> Now to the race. Taking the memory hotplug lock at random places is ugly. I do
> >> wonder if we can use RCU:
> >
> > Ah, yes, exactly what I was thinking above.
> >
> >>
> >> The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():
> >>
> >>         /* the memmap is guaranteed to remain active under RCU */
> >>         rcu_read_lock();
> >>         if (pfn_active(random_pfn)) {
> >>                 page = pfn_to_page(random_pfn);
> >>                 ... use the page, stays valid
> >>         }
> >>         rcu_unread_lock();
> >>
> >> Memory offlining/memremap code:
> >>
> >>         set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically */
> >>         synchronize_rcu();
> >>         /* all users saw the bitmap update, we can invalide the memmap */
> >>         remove_pfn_range_from_zone(zone, pfn, nr_pages);
> >
> > Looks good to me.
> >
> >>
> >>>
> >>>>
> >>>> I only gave it a quick test with DIMMs on x86-64, but didn't test the
> >>>> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
> >>>> on x86-64 and PPC.
> >>>
> >>> I'll give it a spin, but I don't think the kernel wants to grow more
> >>> is_zone_device_page() users.
> >>
> >> Let's recap. In this RFC, I introduce a total of 4 (!) users only.
> >> The other parts can rely on pfn_to_online_page() only.
> >>
> >> 1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
> >> - Basically never used with ZONE_DEVICE.
> >> - We hold a reference!
> >> - All it protects is a SetPageDirty(page);
> >>
> >> 2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
> >> - Same as 1.
> >>
> >> 3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
> >> - We come via virt_to_head_page() / virt_to_head_page(), not sure about
> >>   references (I assume this should be fine as we don't come via random
> >>   PFNs)
> >> - We check that we don't mix Reserved (including device memory) and CMA
> >>   pages when crossing compound pages.
> >>
> >> I think we can drop 1. and 2., resulting in a total of 2 new users in
> >> the same context. I think that is totally tolerable to finally clean
> >> this up.
> >
> > ...but more is_zone_device_page() doesn't "finally clean this up".
> > Like we discussed above it's the missing locking that's the real
> > cleanup, the pfn_to_online_page() internals are secondary.
>
> It's a different cleanup IMHO. We can't do everything in one shot. But
> maybe I can drop the is_zone_device_page() parts from this patch and
> completely rely on pfn_to_online_page(). Yes, that needs fixing to, but
> it's a different story.
>
> The important part of this patch:
>
> While pfn_to_online_page() will always exclude ZONE_DEVICE pages,
> checking PG_reserved on ZONE_DEVICE pages (what we do right now!) is
> racy as hell (especially when concurrently initializing the memmap).
>
> This does improve the situation.

True that's a race a vector I was not considering.

>
> >>
> >> However, I think we also have to clarify if we need the change in 3 at all.
> >> It comes from
> >>
> >> commit f5509cc18daa7f82bcc553be70df2117c8eedc16
> >> Author: Kees Cook <keescook@chromium.org>
> >> Date:   Tue Jun 7 11:05:33 2016 -0700
> >>
> >>     mm: Hardened usercopy
> >>
> >>     This is the start of porting PAX_USERCOPY into the mainline kernel. This
> >>     is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
> >>     work is based on code by PaX Team and Brad Spengler, and an earlier port
> >>     from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
> >> [...]
> >>     - otherwise, object must not span page allocations (excepting Reserved
> >>       and CMA ranges)
> >>
> >> Not sure if we really have to care about ZONE_DEVICE at this point.
> >
> > That check needs to be careful to ignore ZONE_DEVICE pages. There's
> > nothing wrong with a copy spanning ZONE_DEVICE and typical pages.
>
> Please note that the current check would *forbid* this (AFAIKs for a
> single heap object). As discussed in the relevant patch, we might be
> able to just stop doing that and limit it to real PG_reserved pages
> (without ZONE_DEVICE). I'd be happy to not introduce new
> is_zone_device_page() users.

At least for non-HMM ZONE_DEVICE usage, i.e. the dax + pmem stuff, is
excluded from this path by:

52f476a323f9 libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

So this case is one more to add to the pile of HMM auditing.

David Hildenbrand Oct. 23, 2019, 9:22 p.m. UTC | #6

On 23.10.19 21:39, Dan Williams wrote:
> On Wed, Oct 23, 2019 at 10:28 AM David Hildenbrand <david@redhat.com> wrote:
>>
>>>> I dislike this for three reasons
>>>>
>>>> a) It does not protect against any races, really, it does not improve things.
>>>> b) We do have the exact same problem with pfn_to_online_page(). As long as we
>>>>    don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
>>>
>>> True, we need to solve that problem too. That seems to want something
>>> lighter weight than the hotplug lock that can be held over pfn lookups
>>> +  use rather than requiring a page lookup in paths where it's not
>>> clear that a page reference would prevent unplug.
>>>
>>>> c) We mix in ZONE specific stuff into the core. It should be "just another zone"
>>>
>>> Not sure I grok this when the RFC is sprinkling zone-specific
>>> is_zone_device_page() throughout the core?
>>
>> Most users should not care about the zone. pfn_active() would be enough
>> in most situations, especially most PFN walkers - "this memmap is valid
>> and e.g., contains a valid zone ...".
> 
> Oh, I see, you're saying convert most users to pfn_active() (and some
> TBD rcu locking), but only pfn_to_online_page() users would need the
> zone lookup? I can get on board with that.

I guess my answer to that is simple: If we only care about "is this
memmap safe to touch", use pfn_active()

(well, with pfn_valid_within() similar as done in pfn_to_online_page()
due to memory holes, but these are details - e.g., pfn_active() can
check against pfn_valid_within() right away internally). (+locking TBD
to make sure it remains active)

However, if we want to special case in addition on zones (!ZONE_DEVICE
(a.k.a., onlined via memory blocks, managed by the buddy), ZONE_DEVICE,
whatever might come in the future, ...), also access the zone stored in
the memmap. E.g., by using pfn_to_online_page().

> 
>>
>>>
>>>>
>>>> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)
>>>
>>> Sorry I missed this earlier...
>>>
>>>>
>>>> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
>>>> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
>>>> 3. Introduce pfn_active() that checks against the subsection bitmap
>>>> 4. Once the memmap was initialized / prepared, set the subsection active
>>>>    (similar to SECTION_IS_ONLINE in the buddy right now)
>>>> 5. Before the memmap gets invalidated, set the subsection inactive
>>>>    (similar to SECTION_IS_ONLINE in the buddy right now)
>>>> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
>>>> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE
>>>
>>> This does not seem to reduce any complexity because it still requires
>>> a pfn to zone lookup at the end of the process.
>>>
>>> I.e. converting pfn_to_online_page() to use a new pfn_active()
>>> subsection map plus looking up the zone from pfn_to_page() is more
>>> steps than just doing a direct pfn to zone lookup. What am I missing?
>>
>> That a real "pfn to zone" lookup without going via the struct page will
>> require to have more than just a single bitmap. IMHO, keeping the
>> information at a single place (memmap) is the clean thing to do (not
>> replicating it somewhere else). Going via the memmap might not be as
>> fast as a direct lookup, but do we actually care? We are already looking
>> at "random PFNs we are not sure if there is a valid memmap".
> 
> True, we only care about the validity of the check, and as you pointed
> out moving the check to the pfn level does not solve the validity
> race. It needs a lock.

Let's call pfn_active() "a pfn that is active in the system and has an
initialized memmap, which contains sane values" (valid memmap sounds
like pfn_valid(), which is actually "there is a memmap which might
contain garbage"). Yes we need some sort of lightweight locking as
discussed.

[...]

>>>> However, I think we also have to clarify if we need the change in 3 at all.
>>>> It comes from
>>>>
>>>> commit f5509cc18daa7f82bcc553be70df2117c8eedc16
>>>> Author: Kees Cook <keescook@chromium.org>
>>>> Date:   Tue Jun 7 11:05:33 2016 -0700
>>>>
>>>>     mm: Hardened usercopy
>>>>
>>>>     This is the start of porting PAX_USERCOPY into the mainline kernel. This
>>>>     is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
>>>>     work is based on code by PaX Team and Brad Spengler, and an earlier port
>>>>     from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
>>>> [...]
>>>>     - otherwise, object must not span page allocations (excepting Reserved
>>>>       and CMA ranges)
>>>>
>>>> Not sure if we really have to care about ZONE_DEVICE at this point.
>>>
>>> That check needs to be careful to ignore ZONE_DEVICE pages. There's
>>> nothing wrong with a copy spanning ZONE_DEVICE and typical pages.
>>
>> Please note that the current check would *forbid* this (AFAIKs for a
>> single heap object). As discussed in the relevant patch, we might be
>> able to just stop doing that and limit it to real PG_reserved pages
>> (without ZONE_DEVICE). I'd be happy to not introduce new
>> is_zone_device_page() users.
> 
> At least for non-HMM ZONE_DEVICE usage, i.e. the dax + pmem stuff, is
> excluded from this path by:
> 
> 52f476a323f9 libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

Interesting, and very valuable information. So this sounds like patch #2
can go (or convert it to a documentation update).

> 
> So this case is one more to add to the pile of HMM auditing.

Sounds like HMM is some dangerous piece of software we have. This needs
auditing, fixing, and documentation.

BTW, do you have a good source of details about HMM? Especially about
these oddities you mentioned?

Also, can you have a look at patch #2 7/8 and confirm that doing a
SetPageDirty() on a ZONE_DEVICE page is okay (although not useful)? Thanks!

David Hildenbrand Oct. 24, 2019, 12:50 p.m. UTC | #7

On 23.10.19 09:26, David Hildenbrand wrote:
> On 22.10.19 23:54, Dan Williams wrote:
>> Hi David,
>>
>> Thanks for tackling this!
> 
> Thanks for having a look :)
> 
> [...]
> 
> 
>>> I am probably a little bit too careful (but I don't want to break things).
>>> In most places (besides KVM and vfio that are nuts), the
>>> pfn_to_online_page() check could most probably be avoided by a
>>> is_zone_device_page() check. However, I usually get suspicious when I see
>>> a pfn_valid() check (especially after I learned that people mmap parts of
>>> /dev/mem into user space, including memory without memmaps. Also, people
>>> could memmap offline memory blocks this way :/). As long as this does not
>>> hurt performance, I think we should rather do it the clean way.
>>
>> I'm concerned about using is_zone_device_page() in places that are not
>> known to already have a reference to the page. Here's an audit of
>> current usages, and the ones I think need to cleaned up. The "unsafe"
>> ones do not appear to have any protections against the device page
>> being removed (get_dev_pagemap()). Yes, some of these were added by
>> me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
>> pages into anonymous memory paths and I'm not up to speed on how it
>> guarantees 'struct page' validity vs device shutdown without using
>> get_dev_pagemap().
>>
>> smaps_pmd_entry(): unsafe
>>
>> put_devmap_managed_page(): safe, page reference is held
>>
>> is_device_private_page(): safe? gpu driver manages private page lifetime
>>
>> is_pci_p2pdma_page(): safe, page reference is held
>>
>> uncharge_page(): unsafe? HMM
>>
>> add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()
>>
>> soft_offline_page(): unsafe
>>
>> remove_migration_pte(): unsafe? HMM
>>
>> move_to_new_page(): unsafe? HMM
>>
>> migrate_vma_pages() and helpers: unsafe? HMM
>>
>> try_to_unmap_one(): unsafe? HMM
>>
>> __put_page(): safe
>>
>> release_pages(): safe
>>
>> I'm hoping all the HMM ones can be converted to
>> is_device_private_page() directlly and have that routine grow a nice
>> comment about how it knows it can always safely de-reference its @page
>> argument.
>>
>> For the rest I'd like to propose that we add a facility to determine
>> ZONE_DEVICE by pfn rather than page. The most straightforward why I
>> can think of would be to just add another bitmap to mem_section_usage
>> to indicate if a subsection is ZONE_DEVICE or not.
> 
> (it's a somewhat unrelated bigger discussion, but we can start discussing it in this thread)
> 
> I dislike this for three reasons
> 
> a) It does not protect against any races, really, it does not improve things.
> b) We do have the exact same problem with pfn_to_online_page(). As long as we
>     don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
> c) We mix in ZONE specific stuff into the core. It should be "just another zone"
> 
> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)
> 
> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
> 3. Introduce pfn_active() that checks against the subsection bitmap
> 4. Once the memmap was initialized / prepared, set the subsection active
>     (similar to SECTION_IS_ONLINE in the buddy right now)
> 5. Before the memmap gets invalidated, set the subsection inactive
>     (similar to SECTION_IS_ONLINE in the buddy right now)
> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE
> 

Dan, I am suspecting that you want a pfn_to_zone() that will not touch 
the memmap, because it could potentially (altmap) lie on slow memory, right?

A modification might make this possible (but I am not yet sure if we 
want a less generic MM implementation just to fine tune slow memmap 
access here)

1. Keep SECTION_IS_ONLINE as it is with the same semantics
2. Introduce a subsection bitmap to record active ("initialized memmap")
    PFNs. E.g., also set it when setting sections online.
3. Introduce pfn_active() that checks against the subsection bitmap
4. Once the memmap was initialized / prepared, set the subsection active
    (similar to SECTION_IS_ONLINE in the buddy right now)
5. Before the memmap gets invalidated, set the subsection inactive
    (similar to SECTION_IS_ONLINE in the buddy right now)
5. pfn_to_online_page() = pfn_active() && section == SECTION_IS_ONLINE
    (or keep it as is, depends on the RCU locking we eventually
     implement)
6. pfn_to_device_page() = pfn_active() && section != SECTION_IS_ONLINE
7. use pfn_active() whenever we don't care about the zone.

Again, not really a friend of that, it hardcodes ZONE_DEVICE vs. 
!ZONE_DEVICE. When we do a random "pfn_to_page()" (e.g., a pfn walker) 
we really want to touch the memmap right away either way. So we can also 
directly read the zone from it. I really do prefer right now a more 
generic implementation.

[RFC,v1,00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)

Message

Comments