From patchwork Tue Dec 18 04:23:44 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 10734791 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1DB176C5 for ; Tue, 18 Dec 2018 04:36:26 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0B0172A77F for ; Tue, 18 Dec 2018 04:36:26 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id F28042A785; Tue, 18 Dec 2018 04:36:25 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5958A2A77F for ; Tue, 18 Dec 2018 04:36:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 56EB38E0006; Mon, 17 Dec 2018 23:36:23 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 51D3A8E0001; Mon, 17 Dec 2018 23:36:23 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 433638E0006; Mon, 17 Dec 2018 23:36:23 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pg1-f197.google.com (mail-pg1-f197.google.com [209.85.215.197]) by kanga.kvack.org (Postfix) with ESMTP id E6CE88E0001 for ; Mon, 17 Dec 2018 23:36:22 -0500 (EST) Received: by mail-pg1-f197.google.com with SMTP id r13so12602933pgb.7 for ; Mon, 17 Dec 2018 20:36:22 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:subject:from :to:cc:date:message-id:in-reply-to:references:user-agent :mime-version:content-transfer-encoding; bh=FBwD6C9LHwrJwuepoGcOzerrwadbddjaA3KChCu6Cno=; b=PXK9y/N2fVjWh4tYZZ4mYYJPoTZ/lDo80OqBVvV6ZLn9JzSv4GECYAc/876rKX54mp ZWOYXVG1Xk5lQTIre7HXt4R/GI/SlqMaLcX0S5vs2Cqcw+dwdMdmpbTImrB7aciqKdTr p5LyHK9iFJFrJt9gPsTvV4MLm/Kh+3R9Cl70w+J+5dTsLdpSB4U9HPrS8GY4uKYVSH9y fbw0ZHQBKEEEyTGlX6AIIHZUFn/gjjGrYTDB9SN17ALB8j/Lasku/xHk3L7BqKlJeZ1G Z9/9/T+KsLhrZl/bhHx/V8ZQv4Td0gyFWoNRlI64tJutlfB+6FKHoC7RzQx+a+lN/dfJ 4c3w== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of dan.j.williams@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Gm-Message-State: AA+aEWY8j+PUPki1D7FIdCJS2giW550y8sBZZGTHQMFv49swkHv/7Pvn amuuZ0xDS6obWJOH+ag1HtffppNGC6ALBuGWKgpoPllni/eKPLIGwnd2kVqDp2KGILma+UU+2wa 5yjCSc7xc1FHQpe4kpzW/RxV1W5PEV+tqHPVEwvxvj/5r9xIVgzzrKGSI5qPwi84xKg== X-Received: by 2002:a17:902:2c03:: with SMTP id m3mr14344655plb.6.1545107782442; Mon, 17 Dec 2018 20:36:22 -0800 (PST) X-Google-Smtp-Source: AFSGD/Xoiizt1MqvCd/Ju7Zo6ZLsw8jTJw+yrJzm0dr41/WZW88mht+4OlVpsJEE6gEcFedBbxSI X-Received: by 2002:a17:902:2c03:: with SMTP id m3mr14344606plb.6.1545107781064; Mon, 17 Dec 2018 20:36:21 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1545107781; cv=none; d=google.com; s=arc-20160816; b=TxYn5htnavv7YdzLgXmAg5FZ2LkVQADLtoPg9Azy4no78AP/n/dTRXaRY5y10Ktyjl 4g3VqC2A9frdGx0vRpaiZc86PFtwfwu9wHdfQP04r5sSPqaXW2DJY8m0LWE2j1hRskSS 5OkJfqY9BBfJUmJVwgbsB8092KpmEBGdVXyDIxmn1Y6Px2kwFbXjWvsVXb2MXbJi8DBp kVoTVTFtwc2uMaLul2isH86YqgwzwG0+3TSnwM4qCkaiYwn7Ua3+y2mbSsxvoAvZBbbk /xPNCsU+X7qegFyKCDeWc3jqf3rM/AP2uF8YFf9JZk8Sl9wD+vmbjdHp9rGUf1V3sV4J Mqsw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:user-agent:references :in-reply-to:message-id:date:cc:to:from:subject; bh=FBwD6C9LHwrJwuepoGcOzerrwadbddjaA3KChCu6Cno=; b=dwB29C4XBDGNDNf7ntZ7XKEl4jhDNYAhd0cH3U3oSRWkqWtTJlqnjaa9ynoGMPXA/x oUK54lj9aRm8v1J3/utvgpCV+4xZPeUiNlEWZFRcoOYbl56cc2QAy1xxyexPULO7gEHH gvXI4HsANM1mfJEOmsnqtCx+/yP+tapReeyqEWAnO1EW4xx/orTFwh5sISKoKtZ3C6P8 XRWQPVSGYx5iHpptvTahcoGDkz0pQ8dZnjBDg4RU1Ny6mRohfQxZpdhUCrUgQTgYFFoI ycXO5hLEux4m9jV2vASt/OInBUD9hlBwM0OMHX6CKMNTk761T2D/9prRKxjbWluJ2ml/ /vkg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of dan.j.williams@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from mga07.intel.com (mga07.intel.com. [134.134.136.100]) by mx.google.com with ESMTPS id r14si12554650pfh.229.2018.12.17.20.36.20 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 17 Dec 2018 20:36:21 -0800 (PST) Received-SPF: pass (google.com: domain of dan.j.williams@intel.com designates 134.134.136.100 as permitted sender) client-ip=134.134.136.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of dan.j.williams@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 17 Dec 2018 20:36:20 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,367,1539673200"; d="scan'208";a="130830957" Received: from dwillia2-desk3.jf.intel.com (HELO dwillia2-desk3.amr.corp.intel.com) ([10.54.39.16]) by fmsmga001.fm.intel.com with ESMTP; 17 Dec 2018 20:36:19 -0800 Subject: [PATCH v6 4/6] mm: Shuffle initial free memory to improve memory-side-cache utilization From: Dan Williams To: akpm@linux-foundation.org Cc: Michal Hocko , Kees Cook , Dave Hansen , peterz@infradead.org, linux-mm@kvack.org, x86@kernel.org, linux-kernel@vger.kernel.org, mgorman@suse.de Date: Mon, 17 Dec 2018 20:23:44 -0800 Message-ID: <154510702402.1941238.1616430879354317384.stgit@dwillia2-desk3.amr.corp.intel.com> In-Reply-To: <154510700291.1941238.817190985966612531.stgit@dwillia2-desk3.amr.corp.intel.com> References: <154510700291.1941238.817190985966612531.stgit@dwillia2-desk3.amr.corp.intel.com> User-Agent: StGit/0.18-2-gc94f MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Randomization of the page allocator improves the average utilization of a direct-mapped memory-side-cache. Memory side caching is a platform capability that Linux has been previously exposed to in HPC (high-performance computing) environments on specialty platforms. In that instance it was a smaller pool of high-bandwidth-memory relative to higher-capacity / lower-bandwidth DRAM. Now, this capability is going to be found on general purpose server platforms where DRAM is a cache in front of higher latency persistent memory [1]. Robert offered an explanation of the state of the art of Linux interactions with memory-side-caches [2], and I copy it here: It's been a problem in the HPC space: http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/ A kernel module called zonesort is available to try to help: https://software.intel.com/en-us/articles/xeon-phi-software and this abandoned patch series proposed that for the kernel: https://lkml.org/lkml/2017/8/23/195 Dan's patch series doesn't attempt to ensure buffers won't conflict, but also reduces the chance that the buffers will. This will make performance more consistent, albeit slower than "optimal" (which is near impossible to attain in a general-purpose kernel). That's better than forcing users to deploy remedies like: "To eliminate this gradual degradation, we have added a Stream measurement to the Node Health Check that follows each job; nodes are rebooted whenever their measured memory bandwidth falls below 300 GB/s." A replacement for zonesort was merged upstream in commit cc9aec03e58f "x86/numa_emulation: Introduce uniform split capability". With this numa_emulation capability, memory can be split into cache sized ("near-memory" sized) numa nodes. A bind operation to such a node, and disabling workloads on other nodes, enables full cache performance. However, once the workload exceeds the cache size then cache conflicts are unavoidable. While HPC environments might be able to tolerate time-scheduling of cache sized workloads, for general purpose server platforms, the oversubscribed cache case will be the common case. The worst case scenario is that a server system owner benchmarks a workload at boot with an un-contended cache only to see that performance degrade over time, even below the average cache performance due to excessive conflicts. Randomization clips the peaks and fills in the valleys of cache utilization to yield steady average performance. Here are some performance impact details of the patches: 1/ An Intel internal synthetic memory bandwidth measurement tool, saw a 3X speedup in a contrived case that tries to force cache conflicts. The contrived cased used the numa_emulation capability to force an instance of the benchmark to be run in two of the near-memory sized numa nodes. If both instances were placed on the same emulated they would fit and cause zero conflicts. While on separate emulated nodes without randomization they underutilized the cache and conflicted unnecessarily due to the in-order allocation per node. 2/ A well known Java server application benchmark was run with a heap size that exceeded cache size by 3X. The cache conflict rate was 8% for the first run and degraded to 21% after page allocator aging. With randomization enabled the rate levelled out at 11%. 3/ A MongoDB workload did not observe measurable difference in cache-conflict rates, but the overall throughput dropped by 7% with randomization in one case. 4/ Mel Gorman ran his suite of performance workloads with randomization enabled on platforms without a memory-side-cache and saw a mix of some improvements and some losses [3]. While there is potentially significant improvement for applications that depend on low latency access across a wide working-set, the performance may be negligible to negative for other workloads. For this reason the shuffle capability defaults to off unless a direct-mapped memory-side-cache is detected. Even then, the page_alloc.shuffle=0 parameter can be specified to disable the randomization on those systems. Outside of memory-side-cache utilization concerns there is potentially security benefit from randomization. Some data exfiltration and return-oriented-programming attacks rely on the ability to infer the location of sensitive data objects. The kernel page allocator, especially early in system boot, has predictable first-in-first out behavior for physical pages. Pages are freed in physical address order when first onlined. Quoting Kees: "While we already have a base-address randomization (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and memory layouts would certainly be using the predictability of allocation ordering (i.e. for attacks where the base address isn't important: only the relative positions between allocated memory). This is common in lots of heap-style attacks. They try to gain control over ordering by spraying allocations, etc. I'd really like to see this because it gives us something similar to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator." While SLAB_FREELIST_RANDOM reduces the predictability of some local slab caches it leaves vast bulk of memory to be predictably in order allocated. However, it should be noted, the concrete security benefits are hard to quantify, and no known CVE is mitigated by this randomization. Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform a Fisher-Yates shuffle of the page allocator 'free_area' lists when they are initially populated with free memory at boot and at hotplug time. The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e. 10, 4MB this trades off randomization granularity for time spent shuffling. MAX_ORDER-1 was chosen to be minimally invasive to the page allocator while still showing memory-side cache behavior improvements, and the expectation that the security implications of finer granularity randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM. The performance impact of the shuffling appears to be in the noise compared to other memory initialization work. Also the bulk of the work is done in the background as a part of deferred_init_memmap(). This initial randomization can be undone over time so a follow-on patch is introduced to inject entropy on page free decisions. It is reasonable to ask if the page free entropy is sufficient, but it is not enough due to the in-order initial freeing of pages. At the start of that process putting page1 in front or behind page0 still keeps them close together, page2 is still near page1 and has a high chance of being adjacent. As more pages are added ordering diversity improves, but there is still high page locality for the low address pages and this leads to no significant impact to the cache conflict rate. [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/ [2]: https://lkml.org/lkml/2018/9/22/54 [3]: https://lkml.org/lkml/2018/10/12/309 Cc: Michal Hocko Cc: Kees Cook Cc: Dave Hansen Signed-off-by: Dan Williams --- include/linux/list.h | 17 ++++ include/linux/mmzone.h | 4 + include/linux/shuffle.h | 47 ++++++++++ init/Kconfig | 36 ++++++++ mm/Makefile | 7 +- mm/memblock.c | 16 +++ mm/memory_hotplug.c | 3 + mm/page_alloc.c | 3 + mm/shuffle.c | 215 +++++++++++++++++++++++++++++++++++++++++++++++ 9 files changed, 346 insertions(+), 2 deletions(-) create mode 100644 include/linux/shuffle.h create mode 100644 mm/shuffle.c diff --git a/include/linux/list.h b/include/linux/list.h index edb7628e46ed..3dfb8953f241 100644 --- a/include/linux/list.h +++ b/include/linux/list.h @@ -150,6 +150,23 @@ static inline void list_replace_init(struct list_head *old, INIT_LIST_HEAD(old); } +/** + * list_swap - replace entry1 with entry2 and re-add entry1 at entry2's position + * @entry1: the location to place entry2 + * @entry2: the location to place entry1 + */ +static inline void list_swap(struct list_head *entry1, + struct list_head *entry2) +{ + struct list_head *pos = entry2->prev; + + list_del(entry2); + list_replace(entry1, entry2); + if (pos == entry1) + pos = entry2; + list_add(entry1, pos); +} + /** * list_del_init - deletes entry from list and reinitialize it. * @entry: the element to delete from the list. diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 847705a6d0ec..eafa66d66232 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1266,6 +1266,10 @@ void sparse_init(void); #else #define sparse_init() do {} while (0) #define sparse_index_init(_sec, _nid) do {} while (0) +static inline int pfn_present(unsigned long pfn) +{ + return 1; +} #endif /* CONFIG_SPARSEMEM */ /* diff --git a/include/linux/shuffle.h b/include/linux/shuffle.h new file mode 100644 index 000000000000..a8a168919cb5 --- /dev/null +++ b/include/linux/shuffle.h @@ -0,0 +1,47 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright(c) 2018 Intel Corporation. All rights reserved. +#ifndef _MM_SHUFFLE_H +#define _MM_SHUFFLE_H + +enum mm_shuffle_ctl { + SHUFFLE_ENABLE, + SHUFFLE_FORCE_DISABLE, +}; +#ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR +DECLARE_STATIC_KEY_FALSE(page_alloc_shuffle_key); +extern void page_alloc_shuffle(enum mm_shuffle_ctl); +extern void __shuffle_free_memory(pg_data_t *pgdat, unsigned long start_pfn, + unsigned long end_pfn); +static inline void shuffle_free_memory(pg_data_t *pgdat, + unsigned long start_pfn, unsigned long end_pfn) +{ + if (!static_branch_unlikely(&page_alloc_shuffle_key)) + return; + __shuffle_free_memory(pgdat, start_pfn, end_pfn); +} + +extern void __shuffle_zone(struct zone *z, unsigned long start_pfn, + unsigned long end_pfn); +static inline void shuffle_zone(struct zone *z, unsigned long start_pfn, + unsigned long end_pfn) +{ + if (!static_branch_unlikely(&page_alloc_shuffle_key)) + return; + __shuffle_zone(z, start_pfn, end_pfn); +} +#else +static inline void shuffle_free_memory(pg_data_t *pgdat, unsigned long start_pfn, + unsigned long end_pfn) +{ +} + +static inline void shuffle_zone(struct zone *z, unsigned long start_pfn, + unsigned long end_pfn) +{ +} + +static inline void page_alloc_shuffle(void) +{ +} +#endif +#endif /* _MM_SHUFFLE_H */ diff --git a/init/Kconfig b/init/Kconfig index cf5b5a0dcbc2..fa6812d995ec 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1720,6 +1720,42 @@ config SLAB_FREELIST_HARDENED sacrifies to harden the kernel slab allocator against common freelist exploit methods. +config SHUFFLE_PAGE_ALLOCATOR + bool "Page allocator randomization" + depends on HAVE_MEMBLOCK_CACHE_INFO + default SLAB_FREELIST_RANDOM + help + Randomization of the page allocator improves the average + utilization of a direct-mapped memory-side-cache. See section + 5.2.27 Heterogeneous Memory Attribute Table (HMAT) in the ACPI + 6.2a specification for an example of how a platform advertises + the presence of a memory-side-cache. There are also incidental + security benefits as it reduces the predictability of page + allocations to compliment SLAB_FREELIST_RANDOM, but the + default granularity of shuffling on 4MB (MAX_ORDER) pages is + selected based on cache utilization benefits. + + While the randomization improves cache utilization it may + negatively impact workloads on platforms without a cache. For + this reason, by default, the randomization is enabled only + after runtime detection of a direct-mapped memory-side-cache. + Otherwise, the randomization may be force enabled with the + 'page_alloc.shuffle' kernel command line parameter. + + Say Y if unsure. + +config SHUFFLE_PAGE_ORDER + depends on SHUFFLE_PAGE_ALLOCATOR + int "Page allocator shuffle order" + range 0 10 + default 10 + help + Specify the granularity at which shuffling (randomization) is + performed. By default this is set to MAX_ORDER-1 to minimize + runtime impact of randomization and with the expectation that + SLAB_FREELIST_RANDOM mitigates heap attacks on smaller + object granularities. + config SLUB_CPU_PARTIAL default y depends on SLUB && SMP diff --git a/mm/Makefile b/mm/Makefile index d210cc9d6f80..ac5e5ba78874 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -33,7 +33,7 @@ mmu-$(CONFIG_MMU) += process_vm_access.o endif obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ - maccess.o page_alloc.o page-writeback.o \ + maccess.o page-writeback.o \ readahead.o swap.o truncate.o vmscan.o shmem.o \ util.o mmzone.o vmstat.o backing-dev.o \ mm_init.o mmu_context.o percpu.o slab_common.o \ @@ -41,6 +41,11 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ interval_tree.o list_lru.o workingset.o \ debug.o $(mmu-y) +# Give 'page_alloc' its own module-parameter namespace +page-alloc-y := page_alloc.o +page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o + +obj-y += page-alloc.o obj-y += init-mm.o obj-y += memblock.o diff --git a/mm/memblock.c b/mm/memblock.c index 8ebbc77f20c5..e51ecd6c1308 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -850,6 +851,12 @@ int __init_memblock memblock_set_sidecache(phys_addr_t base, phys_addr_t size, r->cache_size = cache_size; r->direct_mapped = direct_mapped; + /* + * Enable randomization for amortizing direct-mapped + * memory-side-cache conflicts. + */ + if (r->size > r->cache_size && r->direct_mapped) + page_alloc_shuffle(SHUFFLE_ENABLE); } return 0; @@ -1971,9 +1978,16 @@ static unsigned long __init free_low_memory_core_early(void) * low ram will be on Node1 */ for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, - NULL) + NULL) { + pg_data_t *pgdat; + count += __free_memory_core(start, end); + for_each_online_pgdat(pgdat) + shuffle_free_memory(pgdat, PHYS_PFN(start), + PHYS_PFN(end)); + } + return count; } diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 2b2b3ccbbfb5..697669ffce32 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include #include @@ -895,6 +896,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ zone->zone_pgdat->node_present_pages += onlined_pages; pgdat_resize_unlock(zone->zone_pgdat, &flags); + shuffle_zone(zone, pfn, zone_end_pfn(zone)); + if (onlined_pages) { node_states_set_node(nid, &arg); if (need_zonelists_rebuild) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2ec9cc407216..eaa9a012d6ae 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -60,6 +60,7 @@ #include #include #include +#include #include #include #include @@ -1595,6 +1596,8 @@ static int __init deferred_init_memmap(void *data) } pgdat_resize_unlock(pgdat, &flags); + shuffle_zone(zone, first_init_pfn, zone_end_pfn(zone)); + /* Sanity check that the next zone really is unpopulated */ WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone)); diff --git a/mm/shuffle.c b/mm/shuffle.c new file mode 100644 index 000000000000..07961ff41a03 --- /dev/null +++ b/mm/shuffle.c @@ -0,0 +1,215 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright(c) 2018 Intel Corporation. All rights reserved. + +#include +#include +#include +#include +#include +#include +#include "internal.h" + +DEFINE_STATIC_KEY_FALSE(page_alloc_shuffle_key); +static unsigned long shuffle_state; + +/* + * Depending on the architecture, module parameter parsing may run + * before, or after the cache detection. SHUFFLE_FORCE_DISABLE prevents, + * or reverts the enabling of the shuffle implementation. SHUFFLE_ENABLE + * attempts to turn on the implementation, but aborts if it finds + * SHUFFLE_FORCE_DISABLE already set. + */ +void page_alloc_shuffle(enum mm_shuffle_ctl ctl) +{ + if (ctl == SHUFFLE_FORCE_DISABLE) + set_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state); + + if (test_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state)) { + if (test_and_clear_bit(SHUFFLE_ENABLE, &shuffle_state)) + static_branch_disable(&page_alloc_shuffle_key); + } else if (ctl == SHUFFLE_ENABLE + && !test_and_set_bit(SHUFFLE_ENABLE, &shuffle_state)) + static_branch_enable(&page_alloc_shuffle_key); +} + +static bool shuffle_param; +extern int shuffle_show(char *buffer, const struct kernel_param *kp) +{ + return sprintf(buffer, "%c\n", test_bit(SHUFFLE_ENABLE, &shuffle_state) + ? 'Y' : 'N'); +} +static int shuffle_store(const char *val, const struct kernel_param *kp) +{ + int rc = param_set_bool(val, kp); + + if (rc < 0) + return rc; + if (shuffle_param) + page_alloc_shuffle(SHUFFLE_ENABLE); + else + page_alloc_shuffle(SHUFFLE_FORCE_DISABLE); + return 0; +} +module_param_call(shuffle, shuffle_store, shuffle_show, &shuffle_param, 0400); + +/* + * For two pages to be swapped in the shuffle, they must be free (on a + * 'free_area' lru), have the same order, and have the same migratetype. + */ +static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order) +{ + struct page *page; + + /* + * Given we're dealing with randomly selected pfns in a zone we + * need to ask questions like... + */ + + /* ...is the pfn even in the memmap? */ + if (!pfn_valid_within(pfn)) + return NULL; + + /* ...is the pfn in a present section or a hole? */ + if (!pfn_present(pfn)) + return NULL; + + /* ...is the page free and currently on a free_area list? */ + page = pfn_to_page(pfn); + if (!PageBuddy(page)) + return NULL; + + /* + * ...is the page on the same list as the page we will + * shuffle it with? + */ + if (page_order(page) != order) + return NULL; + + return page; +} + +/* + * Fisher-Yates shuffle the freelist which prescribes iterating through + * an array, pfns in this case, and randomly swapping each entry with + * another in the span, end_pfn - start_pfn. + * + * To keep the implementation simple it does not attempt to correct for + * sources of bias in the distribution, like modulo bias or + * pseudo-random number generator bias. I.e. the expectation is that + * this shuffling raises the bar for attacks that exploit the + * predictability of page allocations, but need not be a perfect + * shuffle. + * + * Note that we don't use @z->zone_start_pfn and zone_end_pfn(@z) + * directly since the caller may be aware of holes in the zone and can + * improve the accuracy of the random pfn selection. + */ +#define SHUFFLE_RETRY 10 +static void __meminit shuffle_zone_order(struct zone *z, unsigned long start_pfn, + unsigned long end_pfn, const int order) +{ + unsigned long i, flags; + const int order_pages = 1 << order; + + if (start_pfn < z->zone_start_pfn) + start_pfn = z->zone_start_pfn; + if (end_pfn > zone_end_pfn(z)) + end_pfn = zone_end_pfn(z); + + /* probably means that start/end were outside the zone */ + if (end_pfn <= start_pfn) + return; + spin_lock_irqsave(&z->lock, flags); + start_pfn = ALIGN(start_pfn, order_pages); + for (i = start_pfn; i < end_pfn; i += order_pages) { + unsigned long j; + int migratetype, retry; + struct page *page_i, *page_j; + + /* + * We expect page_i, in the sub-range of a zone being + * added (@start_pfn to @end_pfn), to more likely be + * valid compared to page_j randomly selected in the + * span @zone_start_pfn to @spanned_pages. + */ + page_i = shuffle_valid_page(i, order); + if (!page_i) + continue; + + for (retry = 0; retry < SHUFFLE_RETRY; retry++) { + /* + * Pick a random order aligned page from the + * start of the zone. Use the *whole* zone here + * so that if it is freed in tiny pieces that we + * randomize in the whole zone, not just within + * those fragments. + * + * Since page_j comes from a potentially sparse + * address range we want to try a bit harder to + * find a shuffle point for page_i. + */ + j = z->zone_start_pfn + + ALIGN_DOWN(get_random_long() % z->spanned_pages, + order_pages); + page_j = shuffle_valid_page(j, order); + if (page_j && page_j != page_i) + break; + } + if (retry >= SHUFFLE_RETRY) { + pr_debug("%s: failed to swap %#lx\n", __func__, i); + continue; + } + + /* + * Each migratetype corresponds to its own list, make + * sure the types match otherwise we're moving pages to + * lists where they do not belong. + */ + migratetype = get_pageblock_migratetype(page_i); + if (get_pageblock_migratetype(page_j) != migratetype) { + pr_debug("%s: migratetype mismatch %#lx\n", __func__, i); + continue; + } + + list_swap(&page_i->lru, &page_j->lru); + + pr_debug("%s: swap: %#lx -> %#lx\n", __func__, i, j); + + /* take it easy on the zone lock */ + if ((i % (100 * order_pages)) == 0) { + spin_unlock_irqrestore(&z->lock, flags); + cond_resched(); + spin_lock_irqsave(&z->lock, flags); + } + } + spin_unlock_irqrestore(&z->lock, flags); +} + +void __meminit __shuffle_zone(struct zone *z, unsigned long start_pfn, + unsigned long end_pfn) +{ + int i; + + /* shuffle all the orders at the specified order and higher */ + for (i = CONFIG_SHUFFLE_PAGE_ORDER; i < MAX_ORDER; i++) + shuffle_zone_order(z, start_pfn, end_pfn, i); +} + +/** + * shuffle_free_memory - reduce the predictability of the page allocator + * @pgdat: node page data + * @start_pfn: Limit the shuffle to the greater of this value or zone start + * @end_pfn: Limit the shuffle to the less of this value or zone end + * + * While shuffle_zone() attempts to avoid holes with pfn_valid() and + * pfn_present() they can not report sub-section sized holes. @start_pfn + * and @end_pfn limit the shuffle to the exact memory pages being freed. + */ +void __meminit __shuffle_free_memory(pg_data_t *pgdat, unsigned long start_pfn, + unsigned long end_pfn) +{ + struct zone *z; + + for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++) + shuffle_zone(z, start_pfn, end_pfn); +}