From patchwork Fri Feb 1 05:15:17 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 10791799 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5A5A013B4 for ; Fri, 1 Feb 2019 05:28:01 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3D4213190A for ; Fri, 1 Feb 2019 05:28:01 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 2D1F33190C; Fri, 1 Feb 2019 05:28:01 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6D9223190A for ; Fri, 1 Feb 2019 05:27:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6CC488E0003; Fri, 1 Feb 2019 00:27:58 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 67B048E0001; Fri, 1 Feb 2019 00:27:58 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 592428E0003; Fri, 1 Feb 2019 00:27:58 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pl1-f200.google.com (mail-pl1-f200.google.com [209.85.214.200]) by kanga.kvack.org (Postfix) with ESMTP id 076C88E0001 for ; Fri, 1 Feb 2019 00:27:58 -0500 (EST) Received: by mail-pl1-f200.google.com with SMTP id a9so4265594pla.2 for ; Thu, 31 Jan 2019 21:27:57 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:subject:from :to:cc:date:message-id:in-reply-to:references:user-agent :mime-version:content-transfer-encoding; bh=An7uY+A6BJcSgNwEq02Ib5TeQLQgYjhYLH9pddBtHjQ=; b=WbogqsGyYIydFq4IqoiHIj0ruECtCXuN/cWOi8vOzNufQ5OBC3/sbyHNTS9hdzkdkQ Sl2B/uxQ9sBXU1w1M2pnes3+4O+PCKILNlEwi2GwXg0aSW7sSIjLYppNIUBXjnwg5mEa 4uL9GJ3BukiW3kp7ZXOQxEAZ7lKaQgvFzhN96Fex5IMxP1YVOeXzHSUbMBjf0Jj9ndDg kf389uz9SC+9ctN8wIdBKF607nkAqto9oxA3ER7Idj/o8brauIgTPpZdjspkm8G/jPcM UWxF/ioeIgTV2EyPMBWfis3oHqiq72PHrCe2BGFKMiJYiIdQSsKxlTX62s0XvgC9579/ 75jQ== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of dan.j.williams@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Gm-Message-State: AHQUAuYdQzRROs5dqaTvh06vHsgWcOXh965KarfxR6bTflbvugD30J0v zjf7vAagJ4y6UEczwGPDVoQPy8GhyT6HjBpXaHNkq8us/IVbBT2g54If7LmZ+iEFaxXrgFDLRXR 5rfzh5Gu2LuqurLNqsIfDi6F90t5uIN8qg8XkrXR7JEQly9Cn+pUf0HrexmJMXwVaBA== X-Received: by 2002:a63:fd0a:: with SMTP id d10mr946894pgh.164.1548998877584; Thu, 31 Jan 2019 21:27:57 -0800 (PST) X-Google-Smtp-Source: AHgI3IZ5uXjVwdOhPfwmj67lmgg/aHrhawoCGpuGJ+DIjEvBgJj/iMT0ER8euTFFvZPrLFLlj6Hc X-Received: by 2002:a63:fd0a:: with SMTP id d10mr946833pgh.164.1548998875502; Thu, 31 Jan 2019 21:27:55 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1548998875; cv=none; d=google.com; s=arc-20160816; b=X70SMPDN6cfUfpnrsxsxHFuego/3cZZ5lIH/07mtcDRDUybOPNblXk21f9mSPpNNrG mzJ5j2r0h3M2CSvoGhoOOpJhHcSuvR3SrpNRrsqqk1CqAEj3Hpl9+jOMD7x0VobGYEbv N9NFqOdJziqgQqBuEIuAPb0WYYaPLgH7mrDdiXnaionVigelEoB662YjmrVpbZiOVBFT Uc6/IQw97rDHpt/ksNHvdPEGq62WPoAUpV+2cXRY85KQMkGK0DMTDP2QeLe7npJpaHZE 3oCWApRsR5RRlAbMTjr1WCNKCDJW/wLKf1vbq4TUimKuyvuI2YakIvcecFtUKPBUDMXY w+fg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:user-agent:references :in-reply-to:message-id:date:cc:to:from:subject; bh=An7uY+A6BJcSgNwEq02Ib5TeQLQgYjhYLH9pddBtHjQ=; b=tUlDjVYdboFik1pURMQirGF1329GS+uWwEx4ns6weDJzNImxMt7SuUSHS4lttfmbiF 7HggEk2nkd0msF/dPrbWMr0e+P1d9GrZYrgFGIKFkkXoq/oYEPr35I4Cxm0Iz/YqX/nK uD1L72WEyaAu4vP6aTS1tN7426EIlFIQXUSMK9kl70Uj7XnYewxjOTQPNxXwMl70RNny vfBeQSEftfS0mPtgctKBqumOzdgmZsiCHFusuP62P2RsMt8D0li4Bzv8zkonjii/D6lI Wbl30FGm4OvLp40IGmaxAUIhR69uFDk/XQZq7O1Y7ymgC9PuJv05kf2Szruc4AgU3Joc w8Eg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of dan.j.williams@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from mga03.intel.com (mga03.intel.com. [134.134.136.65]) by mx.google.com with ESMTPS id b14si1376437plk.333.2019.01.31.21.27.55 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 31 Jan 2019 21:27:55 -0800 (PST) Received-SPF: pass (google.com: domain of dan.j.williams@intel.com designates 134.134.136.65 as permitted sender) client-ip=134.134.136.65; Authentication-Results: mx.google.com; spf=pass (google.com: domain of dan.j.williams@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga103.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 31 Jan 2019 21:27:54 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,547,1539673200"; d="scan'208";a="130657204" Received: from dwillia2-desk3.jf.intel.com (HELO dwillia2-desk3.amr.corp.intel.com) ([10.54.39.16]) by orsmga002.jf.intel.com with ESMTP; 31 Jan 2019 21:27:54 -0800 Subject: [PATCH v10 1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization From: Dan Williams To: akpm@linux-foundation.org Cc: Dave Hansen , Michal Hocko , Kees Cook , linux-mm@kvack.org, linux-kernel@vger.kernel.org, keith.busch@intel.com Date: Thu, 31 Jan 2019 21:15:17 -0800 Message-ID: <154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com> In-Reply-To: <154899811208.3165233.17623209031065121886.stgit@dwillia2-desk3.amr.corp.intel.com> References: <154899811208.3165233.17623209031065121886.stgit@dwillia2-desk3.amr.corp.intel.com> User-Agent: StGit/0.18-2-gc94f MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Randomization of the page allocator improves the average utilization of a direct-mapped memory-side-cache. Memory side caching is a platform capability that Linux has been previously exposed to in HPC (high-performance computing) environments on specialty platforms. In that instance it was a smaller pool of high-bandwidth-memory relative to higher-capacity / lower-bandwidth DRAM. Now, this capability is going to be found on general purpose server platforms where DRAM is a cache in front of higher latency persistent memory [1]. Robert offered an explanation of the state of the art of Linux interactions with memory-side-caches [2], and I copy it here: It's been a problem in the HPC space: http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/ A kernel module called zonesort is available to try to help: https://software.intel.com/en-us/articles/xeon-phi-software and this abandoned patch series proposed that for the kernel: https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com Dan's patch series doesn't attempt to ensure buffers won't conflict, but also reduces the chance that the buffers will. This will make performance more consistent, albeit slower than "optimal" (which is near impossible to attain in a general-purpose kernel). That's better than forcing users to deploy remedies like: "To eliminate this gradual degradation, we have added a Stream measurement to the Node Health Check that follows each job; nodes are rebooted whenever their measured memory bandwidth falls below 300 GB/s." A replacement for zonesort was merged upstream in commit cc9aec03e58f "x86/numa_emulation: Introduce uniform split capability". With this numa_emulation capability, memory can be split into cache sized ("near-memory" sized) numa nodes. A bind operation to such a node, and disabling workloads on other nodes, enables full cache performance. However, once the workload exceeds the cache size then cache conflicts are unavoidable. While HPC environments might be able to tolerate time-scheduling of cache sized workloads, for general purpose server platforms, the oversubscribed cache case will be the common case. The worst case scenario is that a server system owner benchmarks a workload at boot with an un-contended cache only to see that performance degrade over time, even below the average cache performance due to excessive conflicts. Randomization clips the peaks and fills in the valleys of cache utilization to yield steady average performance. Here are some performance impact details of the patches: 1/ An Intel internal synthetic memory bandwidth measurement tool, saw a 3X speedup in a contrived case that tries to force cache conflicts. The contrived cased used the numa_emulation capability to force an instance of the benchmark to be run in two of the near-memory sized numa nodes. If both instances were placed on the same emulated they would fit and cause zero conflicts. While on separate emulated nodes without randomization they underutilized the cache and conflicted unnecessarily due to the in-order allocation per node. 2/ A well known Java server application benchmark was run with a heap size that exceeded cache size by 3X. The cache conflict rate was 8% for the first run and degraded to 21% after page allocator aging. With randomization enabled the rate levelled out at 11%. 3/ A MongoDB workload did not observe measurable difference in cache-conflict rates, but the overall throughput dropped by 7% with randomization in one case. 4/ Mel Gorman ran his suite of performance workloads with randomization enabled on platforms without a memory-side-cache and saw a mix of some improvements and some losses [3]. While there is potentially significant improvement for applications that depend on low latency access across a wide working-set, the performance may be negligible to negative for other workloads. For this reason the shuffle capability defaults to off unless a direct-mapped memory-side-cache is detected. Even then, the page_alloc.shuffle=0 parameter can be specified to disable the randomization on those systems. Outside of memory-side-cache utilization concerns there is potentially security benefit from randomization. Some data exfiltration and return-oriented-programming attacks rely on the ability to infer the location of sensitive data objects. The kernel page allocator, especially early in system boot, has predictable first-in-first out behavior for physical pages. Pages are freed in physical address order when first onlined. Quoting Kees: "While we already have a base-address randomization (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and memory layouts would certainly be using the predictability of allocation ordering (i.e. for attacks where the base address isn't important: only the relative positions between allocated memory). This is common in lots of heap-style attacks. They try to gain control over ordering by spraying allocations, etc. I'd really like to see this because it gives us something similar to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator." While SLAB_FREELIST_RANDOM reduces the predictability of some local slab caches it leaves vast bulk of memory to be predictably in order allocated. However, it should be noted, the concrete security benefits are hard to quantify, and no known CVE is mitigated by this randomization. Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform a Fisher-Yates shuffle of the page allocator 'free_area' lists when they are initially populated with free memory at boot and at hotplug time. Do this based on either the presence of a page_alloc.shuffle=Y command line parameter, or autodetection of a memory-side-cache (to be added in a follow-on patch). The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e. 10, 4MB this trades off randomization granularity for time spent shuffling. MAX_ORDER-1 was chosen to be minimally invasive to the page allocator while still showing memory-side cache behavior improvements, and the expectation that the security implications of finer granularity randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM. The performance impact of the shuffling appears to be in the noise compared to other memory initialization work. This initial randomization can be undone over time so a follow-on patch is introduced to inject entropy on page free decisions. It is reasonable to ask if the page free entropy is sufficient, but it is not enough due to the in-order initial freeing of pages. At the start of that process putting page1 in front or behind page0 still keeps them close together, page2 is still near page1 and has a high chance of being adjacent. As more pages are added ordering diversity improves, but there is still high page locality for the low address pages and this leads to no significant impact to the cache conflict rate. [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/ [2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM [3]: https://lkml.org/lkml/2018/10/12/309 Cc: Dave Hansen Acked-by: Michal Hocko Reviewed-by: Kees Cook Signed-off-by: Dan Williams --- Documentation/admin-guide/kernel-parameters.txt | 10 + include/linux/list.h | 17 ++ include/linux/mmzone.h | 1 init/Kconfig | 23 +++ mm/Makefile | 7 + mm/memory_hotplug.c | 3 mm/page_alloc.c | 6 + mm/shuffle.c | 170 +++++++++++++++++++++++ mm/shuffle.h | 52 +++++++ 9 files changed, 287 insertions(+), 2 deletions(-) create mode 100644 mm/shuffle.c create mode 100644 mm/shuffle.h diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index b799bcf67d7b..abfd4a325354 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3080,6 +3080,16 @@ This will also cause panics on machine check exceptions. Useful together with panic=30 to trigger a reboot. + page_alloc.shuffle= + [KNL] Boolean flag to control whether the page allocator + should randomize its free lists. The randomization may + be automatically enabled if the kernel detects it is + running on a platform with a direct-mapped memory-side + cache, and this parameter can be used to + override/disable that behavior. The state of the flag + can be read from sysfs at: + /sys/module/page_alloc/parameters/shuffle. + page_owner= [KNL] Boot-time page_owner enabling option. Storage of the information about who allocated each page is disabled in default. With this switch, diff --git a/include/linux/list.h b/include/linux/list.h index edb7628e46ed..3dfb8953f241 100644 --- a/include/linux/list.h +++ b/include/linux/list.h @@ -150,6 +150,23 @@ static inline void list_replace_init(struct list_head *old, INIT_LIST_HEAD(old); } +/** + * list_swap - replace entry1 with entry2 and re-add entry1 at entry2's position + * @entry1: the location to place entry2 + * @entry2: the location to place entry1 + */ +static inline void list_swap(struct list_head *entry1, + struct list_head *entry2) +{ + struct list_head *pos = entry2->prev; + + list_del(entry2); + list_replace(entry1, entry2); + if (pos == entry1) + pos = entry2; + list_add(entry1, pos); +} + /** * list_del_init - deletes entry from list and reinitialize it. * @entry: the element to delete from the list. diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index cc4a507d7ca4..c151c87a728a 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1272,6 +1272,7 @@ void sparse_init(void); #else #define sparse_init() do {} while (0) #define sparse_index_init(_sec, _nid) do {} while (0) +#define pfn_present pfn_valid #endif /* CONFIG_SPARSEMEM */ /* diff --git a/init/Kconfig b/init/Kconfig index d47cb77a220e..cfa199f3e9be 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1714,6 +1714,29 @@ config SLAB_FREELIST_HARDENED sacrifies to harden the kernel slab allocator against common freelist exploit methods. +config SHUFFLE_PAGE_ALLOCATOR + bool "Page allocator randomization" + default SLAB_FREELIST_RANDOM && ACPI_NUMA + help + Randomization of the page allocator improves the average + utilization of a direct-mapped memory-side-cache. See section + 5.2.27 Heterogeneous Memory Attribute Table (HMAT) in the ACPI + 6.2a specification for an example of how a platform advertises + the presence of a memory-side-cache. There are also incidental + security benefits as it reduces the predictability of page + allocations to compliment SLAB_FREELIST_RANDOM, but the + default granularity of shuffling on 4MB (MAX_ORDER) pages is + selected based on cache utilization benefits. + + While the randomization improves cache utilization it may + negatively impact workloads on platforms without a cache. For + this reason, by default, the randomization is enabled only + after runtime detection of a direct-mapped memory-side-cache. + Otherwise, the randomization may be force enabled with the + 'page_alloc.shuffle' kernel command line parameter. + + Say Y if unsure. + config SLUB_CPU_PARTIAL default y depends on SLUB && SMP diff --git a/mm/Makefile b/mm/Makefile index d210cc9d6f80..ac5e5ba78874 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -33,7 +33,7 @@ mmu-$(CONFIG_MMU) += process_vm_access.o endif obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ - maccess.o page_alloc.o page-writeback.o \ + maccess.o page-writeback.o \ readahead.o swap.o truncate.o vmscan.o shmem.o \ util.o mmzone.o vmstat.o backing-dev.o \ mm_init.o mmu_context.o percpu.o slab_common.o \ @@ -41,6 +41,11 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ interval_tree.o list_lru.o workingset.o \ debug.o $(mmu-y) +# Give 'page_alloc' its own module-parameter namespace +page-alloc-y := page_alloc.o +page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o + +obj-y += page-alloc.o obj-y += init-mm.o obj-y += memblock.o diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index b9a667d36c55..0bb10e132f95 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -39,6 +39,7 @@ #include #include "internal.h" +#include "shuffle.h" /* * online_page_callback contains pointer to current page onlining function. @@ -895,6 +896,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ zone->zone_pgdat->node_present_pages += onlined_pages; pgdat_resize_unlock(zone->zone_pgdat, &flags); + shuffle_zone(zone); + if (onlined_pages) { node_states_set_node(nid, &arg); if (need_zonelists_rebuild) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index cde5dac6229a..14ef39445544 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -72,6 +72,7 @@ #include #include #include "internal.h" +#include "shuffle.h" /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ static DEFINE_MUTEX(pcp_batch_high_lock); @@ -1752,9 +1753,9 @@ _deferred_grow_zone(struct zone *zone, unsigned int order) void __init page_alloc_init_late(void) { struct zone *zone; + int nid; #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT - int nid; /* There will be num_node_state(N_MEMORY) threads */ atomic_set(&pgdat_init_n_undone, num_node_state(N_MEMORY)); @@ -1779,6 +1780,9 @@ void __init page_alloc_init_late(void) memblock_discard(); #endif + for_each_node_state(nid, N_MEMORY) + shuffle_free_memory(NODE_DATA(nid)); + for_each_populated_zone(zone) set_zone_contiguous(zone); } diff --git a/mm/shuffle.c b/mm/shuffle.c new file mode 100644 index 000000000000..8badf4f0a852 --- /dev/null +++ b/mm/shuffle.c @@ -0,0 +1,170 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright(c) 2018 Intel Corporation. All rights reserved. + +#include +#include +#include +#include +#include +#include "internal.h" +#include "shuffle.h" + +DEFINE_STATIC_KEY_FALSE(page_alloc_shuffle_key); +static unsigned long shuffle_state __ro_after_init; + +/* + * Depending on the architecture, module parameter parsing may run + * before, or after the cache detection. SHUFFLE_FORCE_DISABLE prevents, + * or reverts the enabling of the shuffle implementation. SHUFFLE_ENABLE + * attempts to turn on the implementation, but aborts if it finds + * SHUFFLE_FORCE_DISABLE already set. + */ +__meminit void page_alloc_shuffle(enum mm_shuffle_ctl ctl) +{ + if (ctl == SHUFFLE_FORCE_DISABLE) + set_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state); + + if (test_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state)) { + if (test_and_clear_bit(SHUFFLE_ENABLE, &shuffle_state)) + static_branch_disable(&page_alloc_shuffle_key); + } else if (ctl == SHUFFLE_ENABLE + && !test_and_set_bit(SHUFFLE_ENABLE, &shuffle_state)) + static_branch_enable(&page_alloc_shuffle_key); +} + +static bool shuffle_param; +extern int shuffle_show(char *buffer, const struct kernel_param *kp) +{ + return sprintf(buffer, "%c\n", test_bit(SHUFFLE_ENABLE, &shuffle_state) + ? 'Y' : 'N'); +} +module_param_call(shuffle, NULL, shuffle_show, &shuffle_param, 0400); + +/* + * For two pages to be swapped in the shuffle, they must be free (on a + * 'free_area' lru), have the same order, and have the same migratetype. + */ +static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order) +{ + struct page *page; + + /* + * Given we're dealing with randomly selected pfns in a zone we + * need to ask questions like... + */ + + /* ...is the pfn even in the memmap? */ + if (!pfn_valid_within(pfn)) + return NULL; + + /* ...is the pfn in a present section or a hole? */ + if (!pfn_present(pfn)) + return NULL; + + /* ...is the page free and currently on a free_area list? */ + page = pfn_to_page(pfn); + if (!PageBuddy(page)) + return NULL; + + /* + * ...is the page on the same list as the page we will + * shuffle it with? + */ + if (page_order(page) != order) + return NULL; + + return page; +} + +/* + * Fisher-Yates shuffle the freelist which prescribes iterating through an + * array, pfns in this case, and randomly swapping each entry with another in + * the span, end_pfn - start_pfn. + * + * To keep the implementation simple it does not attempt to correct for sources + * of bias in the distribution, like modulo bias or pseudo-random number + * generator bias. I.e. the expectation is that this shuffling raises the bar + * for attacks that exploit the predictability of page allocations, but need not + * be a perfect shuffle. + */ +#define SHUFFLE_RETRY 10 +void __meminit __shuffle_zone(struct zone *z) +{ + unsigned long i, flags; + unsigned long start_pfn = z->zone_start_pfn; + unsigned long end_pfn = zone_end_pfn(z); + const int order = SHUFFLE_ORDER; + const int order_pages = 1 << order; + + spin_lock_irqsave(&z->lock, flags); + start_pfn = ALIGN(start_pfn, order_pages); + for (i = start_pfn; i < end_pfn; i += order_pages) { + unsigned long j; + int migratetype, retry; + struct page *page_i, *page_j; + + /* + * We expect page_i, in the sub-range of a zone being added + * (@start_pfn to @end_pfn), to more likely be valid compared to + * page_j randomly selected in the span @zone_start_pfn to + * @spanned_pages. + */ + page_i = shuffle_valid_page(i, order); + if (!page_i) + continue; + + for (retry = 0; retry < SHUFFLE_RETRY; retry++) { + /* + * Pick a random order aligned page in the zone span as + * a swap target. If the selected pfn is a hole, retry + * up to SHUFFLE_RETRY attempts find a random valid pfn + * in the zone. + */ + j = z->zone_start_pfn + + ALIGN_DOWN(get_random_long() % z->spanned_pages, + order_pages); + page_j = shuffle_valid_page(j, order); + if (page_j && page_j != page_i) + break; + } + if (retry >= SHUFFLE_RETRY) { + pr_debug("%s: failed to swap %#lx\n", __func__, i); + continue; + } + + /* + * Each migratetype corresponds to its own list, make sure the + * types match otherwise we're moving pages to lists where they + * do not belong. + */ + migratetype = get_pageblock_migratetype(page_i); + if (get_pageblock_migratetype(page_j) != migratetype) { + pr_debug("%s: migratetype mismatch %#lx\n", __func__, i); + continue; + } + + list_swap(&page_i->lru, &page_j->lru); + + pr_debug("%s: swap: %#lx -> %#lx\n", __func__, i, j); + + /* take it easy on the zone lock */ + if ((i % (100 * order_pages)) == 0) { + spin_unlock_irqrestore(&z->lock, flags); + cond_resched(); + spin_lock_irqsave(&z->lock, flags); + } + } + spin_unlock_irqrestore(&z->lock, flags); +} + +/** + * shuffle_free_memory - reduce the predictability of the page allocator + * @pgdat: node page data + */ +void __meminit __shuffle_free_memory(pg_data_t *pgdat) +{ + struct zone *z; + + for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++) + shuffle_zone(z); +} diff --git a/mm/shuffle.h b/mm/shuffle.h new file mode 100644 index 000000000000..644c8ee97b9e --- /dev/null +++ b/mm/shuffle.h @@ -0,0 +1,52 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright(c) 2018 Intel Corporation. All rights reserved. +#ifndef _MM_SHUFFLE_H +#define _MM_SHUFFLE_H +#include + +/* + * SHUFFLE_ENABLE is called from the command line enabling path, or by + * platform-firmware enabling that indicates the presence of a + * direct-mapped memory-side-cache. SHUFFLE_FORCE_DISABLE is called from + * the command line path and overrides any previous or future + * SHUFFLE_ENABLE. + */ +enum mm_shuffle_ctl { + SHUFFLE_ENABLE, + SHUFFLE_FORCE_DISABLE, +}; + +#define SHUFFLE_ORDER (MAX_ORDER-1) + +#ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR +DECLARE_STATIC_KEY_FALSE(page_alloc_shuffle_key); +extern void page_alloc_shuffle(enum mm_shuffle_ctl ctl); +extern void __shuffle_free_memory(pg_data_t *pgdat); +static inline void shuffle_free_memory(pg_data_t *pgdat) +{ + if (!static_branch_unlikely(&page_alloc_shuffle_key)) + return; + __shuffle_free_memory(pgdat); +} + +extern void __shuffle_zone(struct zone *z); +static inline void shuffle_zone(struct zone *z) +{ + if (!static_branch_unlikely(&page_alloc_shuffle_key)) + return; + __shuffle_zone(z); +} +#else +static inline void shuffle_free_memory(pg_data_t *pgdat) +{ +} + +static inline void shuffle_zone(struct zone *z) +{ +} + +static inline void page_alloc_shuffle(enum mm_shuffle_ctl ctl) +{ +} +#endif +#endif /* _MM_SHUFFLE_H */ From patchwork Fri Feb 1 05:15:22 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 10791801 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E573C13B5 for ; Fri, 1 Feb 2019 05:28:04 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D17983190A for ; Fri, 1 Feb 2019 05:28:04 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id C2D113190C; Fri, 1 Feb 2019 05:28:04 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C2B1C3190A for ; Fri, 1 Feb 2019 05:28:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6CB498E0004; Fri, 1 Feb 2019 00:28:02 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 67ACD8E0001; Fri, 1 Feb 2019 00:28:02 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 591918E0004; Fri, 1 Feb 2019 00:28:02 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pg1-f199.google.com (mail-pg1-f199.google.com [209.85.215.199]) by kanga.kvack.org (Postfix) with ESMTP id 17E1A8E0001 for ; Fri, 1 Feb 2019 00:28:02 -0500 (EST) Received: by mail-pg1-f199.google.com with SMTP id p4so3862772pgj.21 for ; Thu, 31 Jan 2019 21:28:02 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:subject:from :to:cc:date:message-id:in-reply-to:references:user-agent :mime-version:content-transfer-encoding; bh=ek6GtjcmVUDWbKs+WnJ5MRWRCQt54S5YJhX4TC0I7Dc=; b=VuDJUR7t/X+Z2SqCAXiDOnMc+FSVcSge63EJ+1ZvV8gtO4h8muNKeXIHOx5Dm16RIW omlTHr45n90Qpobxu+njVKNm4utZ5oH/zzoMHHUXw+7ua7Bd0RkvS0foQxktbn08JdIG /t4UI5olWNbitRPReNbMLQHbi9IhC0KoI3pxxxnbaenSR9X5jMlPojXT6UYxg2mhqVf9 LZm7smtBgNHmwzjuy0SChoXU6ZX+CPb4x57g5Q8yL0TGmWe94EDC7EW+VzQPQolE3GEt jBJmkkyfVoX/AozlVZrdByoYiYyfWlgmJIoIso9B2ySUf2JEsS7khgZ4QO9tEDOAMaFL 8Lsg== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of dan.j.williams@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Gm-Message-State: AHQUAuaeBSQ3+Z/zl0M/XxFXJoqLPJiO1KILko91Eh9KEtxl9ZWrtWgw dEiP6DLrtM6ZHyVHX+XCzbMgLtxZ+QSbRC0LHYgPMpHNjBDsoMEaQsNnzTXuELqqj0wT8DYwYXO dVpe/7SBuyR+/w8vQInToTPwUEJtMe0oHvXqf1CpaW4mLjQhThkczHKNgRRQxNMKtTw== X-Received: by 2002:a62:8a57:: with SMTP id y84mr6877306pfd.197.1548998881719; Thu, 31 Jan 2019 21:28:01 -0800 (PST) X-Google-Smtp-Source: AHgI3IYCuBUjuYoNiFgx/NfS01B0/gLvmSzB86BuA53U8VlvdyTQnfEs3+WNavBadX0uxy7PbWon X-Received: by 2002:a62:8a57:: with SMTP id y84mr6877268pfd.197.1548998880628; Thu, 31 Jan 2019 21:28:00 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1548998880; cv=none; d=google.com; s=arc-20160816; b=kpMk4T4RSnsaa/PhD13sXOqFb8W/is8rACmAE4vAfbFuZ+jePhGs9IFNIewkUuDtT4 Jx9GhwQEQcpEa3UuUsRY/DlYw+kjt15Pkaf7Lo0wiACjX7/ET7FCAChK7Sk+xPljRJQi RYkMLfTgfwOdbe43+g7DevzjLQgu5tbDpC33YtuKfN5im5jtFdsCv9mOIRy7fxSzDndX HnYVdWVH0egL04WsxxeoPjOCfBbUEFgy8KOWrEUMFzB7gl2H6jFBJ2h6v7Op8B2TsJgv 4L8/U6Ws7sgkYOwanfsJvWhTeRzOma92TeORFzmmExElKSUpoPX2Js9ysudZYeOzoCF7 Q7OA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:user-agent:references :in-reply-to:message-id:date:cc:to:from:subject; bh=ek6GtjcmVUDWbKs+WnJ5MRWRCQt54S5YJhX4TC0I7Dc=; b=uhUMdZ+sJJQznr3Bj1MoMDa+tPzefvYC+j5n/Cl0n3kkBLmP1vaVys/QA8y74lxRHR BHuUiPMN4qX1lTDBlB/DHKNKFMPpHclH8XRZj8sZh8ODcxzloyFCDo8yqEn/vTCkJahp C2i6YLS4RuV7PWmlrA4hHxhK4zBn5N+/dPQQBbnxRRWZnUSA/n8HDku+ve+zCBMIxeUP Vmaaa7PsUxdMIW8pgktAW0BaQJFKnMBno2fezO+n7G1/U5p5wF+e9pCUVmX8M+YwDNzK INSzwOHa6eu2ZjjcvY3byjwT7XnTWDtKb6nHJ5LkTMD7r5aAKW02TwXZlckhVByAprT4 FkPw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of dan.j.williams@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from mga18.intel.com (mga18.intel.com. [134.134.136.126]) by mx.google.com with ESMTPS id i18si1986542pgl.414.2019.01.31.21.28.00 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 31 Jan 2019 21:28:00 -0800 (PST) Received-SPF: pass (google.com: domain of dan.j.williams@intel.com designates 134.134.136.126 as permitted sender) client-ip=134.134.136.126; Authentication-Results: mx.google.com; spf=pass (google.com: domain of dan.j.williams@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga106.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 31 Jan 2019 21:28:00 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,547,1539673200"; d="scan'208";a="112832716" Received: from dwillia2-desk3.jf.intel.com (HELO dwillia2-desk3.amr.corp.intel.com) ([10.54.39.16]) by orsmga006.jf.intel.com with ESMTP; 31 Jan 2019 21:28:00 -0800 Subject: [PATCH v10 2/3] mm: Move buddy list manipulations into helpers From: Dan Williams To: akpm@linux-foundation.org Cc: Michal Hocko , Dave Hansen , linux-mm@kvack.org, linux-kernel@vger.kernel.org, keith.busch@intel.com Date: Thu, 31 Jan 2019 21:15:22 -0800 Message-ID: <154899812264.3165233.5219320056406926223.stgit@dwillia2-desk3.amr.corp.intel.com> In-Reply-To: <154899811208.3165233.17623209031065121886.stgit@dwillia2-desk3.amr.corp.intel.com> References: <154899811208.3165233.17623209031065121886.stgit@dwillia2-desk3.amr.corp.intel.com> User-Agent: StGit/0.18-2-gc94f MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP In preparation for runtime randomization of the zone lists, take all (well, most of) the list_*() functions in the buddy allocator and put them in helper functions. Provide a common control point for injecting additional behavior when freeing pages. Acked-by: Michal Hocko Cc: Dave Hansen Signed-off-by: Dan Williams Signed-off-by: Vlastimil Babka Acked-by: Dan Williams --- include/linux/mm.h | 3 -- include/linux/mm_types.h | 3 ++ include/linux/mmzone.h | 46 ++++++++++++++++++++++++++++++ mm/compaction.c | 4 +-- mm/page_alloc.c | 70 ++++++++++++++++++---------------------------- 5 files changed, 79 insertions(+), 47 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 80bb6408fe73..1621acd10f83 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -500,9 +500,6 @@ static inline void vma_set_anonymous(struct vm_area_struct *vma) struct mmu_gather; struct inode; -#define page_private(page) ((page)->private) -#define set_page_private(page, v) ((page)->private = (v)) - #if !defined(__HAVE_ARCH_PTE_DEVMAP) || !defined(CONFIG_TRANSPARENT_HUGEPAGE) static inline int pmd_devmap(pmd_t pmd) { diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 2c471a2c43fa..1c7dc7ffa288 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -214,6 +214,9 @@ struct page { #define PAGE_FRAG_CACHE_MAX_SIZE __ALIGN_MASK(32768, ~PAGE_MASK) #define PAGE_FRAG_CACHE_MAX_ORDER get_order(PAGE_FRAG_CACHE_MAX_SIZE) +#define page_private(page) ((page)->private) +#define set_page_private(page, v) ((page)->private = (v)) + struct page_frag_cache { void * va; #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index c151c87a728a..2274e43933ae 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -18,6 +18,8 @@ #include #include #include +#include +#include #include /* Free memory management - zoned buddy allocator. */ @@ -98,6 +100,50 @@ struct free_area { unsigned long nr_free; }; +/* Used for pages not on another list */ +static inline void add_to_free_area(struct page *page, struct free_area *area, + int migratetype) +{ + list_add(&page->lru, &area->free_list[migratetype]); + area->nr_free++; +} + +/* Used for pages not on another list */ +static inline void add_to_free_area_tail(struct page *page, struct free_area *area, + int migratetype) +{ + list_add_tail(&page->lru, &area->free_list[migratetype]); + area->nr_free++; +} + +/* Used for pages which are on another list */ +static inline void move_to_free_area(struct page *page, struct free_area *area, + int migratetype) +{ + list_move(&page->lru, &area->free_list[migratetype]); +} + +static inline struct page *get_page_from_free_area(struct free_area *area, + int migratetype) +{ + return list_first_entry_or_null(&area->free_list[migratetype], + struct page, lru); +} + +static inline void del_page_from_free_area(struct page *page, + struct free_area *area, int migratetype) +{ + list_del(&page->lru); + __ClearPageBuddy(page); + set_page_private(page, 0); + area->nr_free--; +} + +static inline bool free_area_empty(struct free_area *area, int migratetype) +{ + return list_empty(&area->free_list[migratetype]); +} + struct pglist_data; /* diff --git a/mm/compaction.c b/mm/compaction.c index ef29490b0f46..a22ac7ab65c5 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1359,13 +1359,13 @@ static enum compact_result __compact_finished(struct zone *zone, bool can_steal; /* Job done if page is free of the right migratetype */ - if (!list_empty(&area->free_list[migratetype])) + if (!free_area_empty(area, migratetype)) return COMPACT_SUCCESS; #ifdef CONFIG_CMA /* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */ if (migratetype == MIGRATE_MOVABLE && - !list_empty(&area->free_list[MIGRATE_CMA])) + !free_area_empty(area, MIGRATE_CMA)) return COMPACT_SUCCESS; #endif /* diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 14ef39445544..3fd0df403766 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -743,12 +743,6 @@ static inline void set_page_order(struct page *page, unsigned int order) __SetPageBuddy(page); } -static inline void rmv_page_order(struct page *page) -{ - __ClearPageBuddy(page); - set_page_private(page, 0); -} - /* * This function checks whether a page is free && is the buddy * we can coalesce a page and its buddy if @@ -849,13 +843,11 @@ static inline void __free_one_page(struct page *page, * Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page, * merge with it and move up one order. */ - if (page_is_guard(buddy)) { + if (page_is_guard(buddy)) clear_page_guard(zone, buddy, order, migratetype); - } else { - list_del(&buddy->lru); - zone->free_area[order].nr_free--; - rmv_page_order(buddy); - } + else + del_page_from_free_area(buddy, &zone->free_area[order], + migratetype); combined_pfn = buddy_pfn & pfn; page = page + (combined_pfn - pfn); pfn = combined_pfn; @@ -905,15 +897,13 @@ static inline void __free_one_page(struct page *page, higher_buddy = higher_page + (buddy_pfn - combined_pfn); if (pfn_valid_within(buddy_pfn) && page_is_buddy(higher_page, higher_buddy, order + 1)) { - list_add_tail(&page->lru, - &zone->free_area[order].free_list[migratetype]); - goto out; + add_to_free_area_tail(page, &zone->free_area[order], + migratetype); + return; } } - list_add(&page->lru, &zone->free_area[order].free_list[migratetype]); -out: - zone->free_area[order].nr_free++; + add_to_free_area(page, &zone->free_area[order], migratetype); } /* @@ -1853,7 +1843,7 @@ static inline void expand(struct zone *zone, struct page *page, if (set_page_guard(zone, &page[size], high, migratetype)) continue; - list_add(&page[size].lru, &area->free_list[migratetype]); + add_to_free_area(&page[size], area, migratetype); area->nr_free++; set_page_order(&page[size], high); } @@ -1995,13 +1985,10 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, /* Find a page of the appropriate size in the preferred list */ for (current_order = order; current_order < MAX_ORDER; ++current_order) { area = &(zone->free_area[current_order]); - page = list_first_entry_or_null(&area->free_list[migratetype], - struct page, lru); + page = get_page_from_free_area(area, migratetype); if (!page) continue; - list_del(&page->lru); - rmv_page_order(page); - area->nr_free--; + del_page_from_free_area(page, area, migratetype); expand(zone, page, order, current_order, area, migratetype); set_pcppage_migratetype(page, migratetype); return page; @@ -2087,8 +2074,7 @@ static int move_freepages(struct zone *zone, } order = page_order(page); - list_move(&page->lru, - &zone->free_area[order].free_list[migratetype]); + move_to_free_area(page, &zone->free_area[order], migratetype); page += 1 << order; pages_moved += 1 << order; } @@ -2264,7 +2250,7 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page, single_page: area = &zone->free_area[current_order]; - list_move(&page->lru, &area->free_list[start_type]); + move_to_free_area(page, area, start_type); } /* @@ -2288,7 +2274,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order, if (fallback_mt == MIGRATE_TYPES) break; - if (list_empty(&area->free_list[fallback_mt])) + if (free_area_empty(area, fallback_mt)) continue; if (can_steal_fallback(order, migratetype)) @@ -2375,9 +2361,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, for (order = 0; order < MAX_ORDER; order++) { struct free_area *area = &(zone->free_area[order]); - page = list_first_entry_or_null( - &area->free_list[MIGRATE_HIGHATOMIC], - struct page, lru); + page = get_page_from_free_area(area, MIGRATE_HIGHATOMIC); if (!page) continue; @@ -2500,8 +2484,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, VM_BUG_ON(current_order == MAX_ORDER); do_steal: - page = list_first_entry(&area->free_list[fallback_mt], - struct page, lru); + page = get_page_from_free_area(area, fallback_mt); steal_suitable_fallback(zone, page, alloc_flags, start_migratetype, can_steal); @@ -2938,6 +2921,7 @@ EXPORT_SYMBOL_GPL(split_page); int __isolate_free_page(struct page *page, unsigned int order) { + struct free_area *area = &page_zone(page)->free_area[order]; unsigned long watermark; struct zone *zone; int mt; @@ -2962,9 +2946,8 @@ int __isolate_free_page(struct page *page, unsigned int order) } /* Remove page from free list */ - list_del(&page->lru); - zone->free_area[order].nr_free--; - rmv_page_order(page); + + del_page_from_free_area(page, area, mt); /* * Set the pageblock if the isolated page is at least half of a @@ -3266,13 +3249,13 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, continue; for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) { - if (!list_empty(&area->free_list[mt])) + if (!free_area_empty(area, mt)) return true; } #ifdef CONFIG_CMA if ((alloc_flags & ALLOC_CMA) && - !list_empty(&area->free_list[MIGRATE_CMA])) { + !free_area_empty(area, MIGRATE_CMA)) { return true; } #endif @@ -5174,7 +5157,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) types[order] = 0; for (type = 0; type < MIGRATE_TYPES; type++) { - if (!list_empty(&area->free_list[type])) + if (!free_area_empty(area, type)) types[order] |= 1 << type; } } @@ -8319,6 +8302,9 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) spin_lock_irqsave(&zone->lock, flags); pfn = start_pfn; while (pfn < end_pfn) { + struct free_area *area; + int mt; + if (!pfn_valid(pfn)) { pfn++; continue; @@ -8337,13 +8323,13 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) BUG_ON(page_count(page)); BUG_ON(!PageBuddy(page)); order = page_order(page); + area = &zone->free_area[order]; #ifdef CONFIG_DEBUG_VM pr_info("remove from free list %lx %d %lx\n", pfn, 1 << order, end_pfn); #endif - list_del(&page->lru); - rmv_page_order(page); - zone->free_area[order].nr_free--; + mt = get_pageblock_migratetype(page); + del_page_from_free_area(page, area, mt); for (i = 0; i < (1 << order); i++) SetPageReserved((page+i)); pfn += (1 << order); From patchwork Fri Feb 1 05:15:27 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 10791803 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D116413B5 for ; Fri, 1 Feb 2019 05:28:09 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C02B73190A for ; Fri, 1 Feb 2019 05:28:09 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id B3E3C3190C; Fri, 1 Feb 2019 05:28:09 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1E8AF3190A for ; Fri, 1 Feb 2019 05:28:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E44CA8E0005; Fri, 1 Feb 2019 00:28:07 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id DF4638E0001; Fri, 1 Feb 2019 00:28:07 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CE5918E0005; Fri, 1 Feb 2019 00:28:07 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pl1-f199.google.com (mail-pl1-f199.google.com [209.85.214.199]) by kanga.kvack.org (Postfix) with ESMTP id 8EC578E0001 for ; Fri, 1 Feb 2019 00:28:07 -0500 (EST) Received: by mail-pl1-f199.google.com with SMTP id 12so4224722plb.18 for ; Thu, 31 Jan 2019 21:28:07 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:subject:from :to:cc:date:message-id:in-reply-to:references:user-agent :mime-version:content-transfer-encoding; bh=203f4hHw7wpu5c0wkT/XNvZ8JNYa2RhshYOJCFq6V7M=; b=UkfVrmDb65zpC3YSH4s/LzBjJMtBaeehwbU8XpoU/esZW8IhO5Xmrx6a/jDe6ajjMq r5i6el0Y5riPzlzMQK8dFJPCtq/aQ0CsupimC9ts+KBmoNt1a3PrOl/6G+ksp3yzZTtk hKkoZOiTzHJHQfDJPPWg0Cex2wMLvlBnygVWAvC0RCvwl0lYXl5IOszYRhzqyB9diJ2i YoyC8fdSLqHMYdRxntJLzpE/Jw7AqHhehHh6Uvt3hjzxHu9gzp+NlBx4GsQ+QKPueESm q3JyxcWFgMYFufMGzcjIdpgKKmwHCSdmQmsb0Rrz/qL5ffh3VQ4wcbKuvh/dGOwirJpK ivMQ== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of dan.j.williams@intel.com designates 192.55.52.120 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Gm-Message-State: AJcUukd+xB2JA/ZEh3dTEs2S8BKo7d6n7IBJSDgxjqlxxcLhLN6t8pcj SJm2VI3lfsSBJLonfFFrFTB6KzJHtrc5ol0g3J6jAfDoQXqhTd7MR+emzIynu6j5wJ7Dvlzj13g NRfIObMScheR8OaYpNA4sG/tvRAOzVXw2yzh4tJM2ZAuLFEeyLpzMgfSFWJ8hm+7vKg== X-Received: by 2002:a62:ca05:: with SMTP id n5mr37992805pfg.154.1548998887245; Thu, 31 Jan 2019 21:28:07 -0800 (PST) X-Google-Smtp-Source: ALg8bN72/qxRdPa98A9wiszHr/CH6NEuKKF+X2SqoR2oMFkVtQ0l2XR92YYsxEsueIn9/cl1JhGU X-Received: by 2002:a62:ca05:: with SMTP id n5mr37992767pfg.154.1548998886334; Thu, 31 Jan 2019 21:28:06 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1548998886; cv=none; d=google.com; s=arc-20160816; b=V40arGupyTT0hVmCq+pQ/6+/QOp1bqOuPhuBO0Z3QMm4/3YRuAK3LlDtd3YgxmsBji N5FNZiZTj7qcD1bvSpNT/aywRgXLZ5mY36k/jadqbrvapYyY8qcl5VFrykkcmGdgMr8S tlpqvm1h9icCXbeL7P/CBo1gZMEefpajwrgzVJjM18HG1e3x8PQj3ooq3hi1+wkAP2Gm fuAmltGJEWbXkmSUQdZrO0bPRyiCKdU9S2j7MVzW1dQp47g0gRJxuj6DxVaUjhsdzB1h 0kxDxvO8PjI87jHjNfWuu9WeaApSGo18FuADwd1zHzk/zWJudv9ZFLnYgW2fW9dCC6My 4k/g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:user-agent:references :in-reply-to:message-id:date:cc:to:from:subject; bh=203f4hHw7wpu5c0wkT/XNvZ8JNYa2RhshYOJCFq6V7M=; b=vpLKQSCaVWMAug/1GVbjeVwsx+DPgWT7BfM66yVqb3+6EZxuH6ogfJ8SQouBV+LaVT AlrSu7Xl7C3R/E1ZrW6vpMJsNqAIRAFzNUjfHwyt4t9YAV6VowvJBS9B1JKzwViW6Z4m T9nZPODwhXdvYXxx29UFkO1/KJXeg6jxtsy75po1R9/TdWfbGzCazSoTLGYvPnM5PLMK KrXkgYRDPO313dYVFlmiwmlWG4wVvEte7tldzNOuJ5ZVY3WhOZCUjIMC4fEOMwurr822 BD1n16GMgz7vMMVBtVdpuT0xb9ByDqNyQl+s55MnCwljkBFu5iaZyVXJjj6g6oLGs9kp AvMw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of dan.j.williams@intel.com designates 192.55.52.120 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from mga04.intel.com (mga04.intel.com. [192.55.52.120]) by mx.google.com with ESMTPS id m10si6191393plt.295.2019.01.31.21.28.06 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 31 Jan 2019 21:28:06 -0800 (PST) Received-SPF: pass (google.com: domain of dan.j.williams@intel.com designates 192.55.52.120 as permitted sender) client-ip=192.55.52.120; Authentication-Results: mx.google.com; spf=pass (google.com: domain of dan.j.williams@intel.com designates 192.55.52.120 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 31 Jan 2019 21:28:05 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,547,1539673200"; d="scan'208";a="114410573" Received: from dwillia2-desk3.jf.intel.com (HELO dwillia2-desk3.amr.corp.intel.com) ([10.54.39.16]) by orsmga008.jf.intel.com with ESMTP; 31 Jan 2019 21:28:05 -0800 Subject: [PATCH v10 3/3] mm: Maintain randomization of page free lists From: Dan Williams To: akpm@linux-foundation.org Cc: Michal Hocko , Dave Hansen , Kees Cook , linux-mm@kvack.org, linux-kernel@vger.kernel.org, keith.busch@intel.com Date: Thu, 31 Jan 2019 21:15:27 -0800 Message-ID: <154899812788.3165233.9066631950746578517.stgit@dwillia2-desk3.amr.corp.intel.com> In-Reply-To: <154899811208.3165233.17623209031065121886.stgit@dwillia2-desk3.amr.corp.intel.com> References: <154899811208.3165233.17623209031065121886.stgit@dwillia2-desk3.amr.corp.intel.com> User-Agent: StGit/0.18-2-gc94f MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP When freeing a page with an order >= shuffle_page_order randomly select the front or back of the list for insertion. While the mm tries to defragment physical pages into huge pages this can tend to make the page allocator more predictable over time. Inject the front-back randomness to preserve the initial randomness established by shuffle_free_memory() when the kernel was booted. The overhead of this manipulation is constrained by only being applied for MAX_ORDER sized pages by default. Cc: Michal Hocko Cc: Dave Hansen Reviewed-by: Kees Cook Signed-off-by: Dan Williams Acked-by: Michal Hocko --- include/linux/mmzone.h | 12 ++++++++++++ mm/page_alloc.c | 11 +++++++++-- mm/shuffle.c | 23 +++++++++++++++++++++++ mm/shuffle.h | 12 ++++++++++++ 4 files changed, 56 insertions(+), 2 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 2274e43933ae..a3cb9a21196d 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -116,6 +116,18 @@ static inline void add_to_free_area_tail(struct page *page, struct free_area *ar area->nr_free++; } +#ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR +/* Used to preserve page allocation order entropy */ +void add_to_free_area_random(struct page *page, struct free_area *area, + int migratetype); +#else +static inline void add_to_free_area_random(struct page *page, + struct free_area *area, int migratetype) +{ + add_to_free_area(page, area, migratetype); +} +#endif + /* Used for pages which are on another list */ static inline void move_to_free_area(struct page *page, struct free_area *area, int migratetype) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3fd0df403766..2a0969e3b0eb 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -43,6 +43,7 @@ #include #include #include +#include #include #include #include @@ -889,7 +890,8 @@ static inline void __free_one_page(struct page *page, * so it's less likely to be used soon and more likely to be merged * as a higher order page */ - if ((order < MAX_ORDER-2) && pfn_valid_within(buddy_pfn)) { + if ((order < MAX_ORDER-2) && pfn_valid_within(buddy_pfn) + && !is_shuffle_order(order)) { struct page *higher_page, *higher_buddy; combined_pfn = buddy_pfn & pfn; higher_page = page + (combined_pfn - pfn); @@ -903,7 +905,12 @@ static inline void __free_one_page(struct page *page, } } - add_to_free_area(page, &zone->free_area[order], migratetype); + if (is_shuffle_order(order)) + add_to_free_area_random(page, &zone->free_area[order], + migratetype); + else + add_to_free_area(page, &zone->free_area[order], migratetype); + } /* diff --git a/mm/shuffle.c b/mm/shuffle.c index 8badf4f0a852..19bbf3e37fb6 100644 --- a/mm/shuffle.c +++ b/mm/shuffle.c @@ -168,3 +168,26 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat) for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++) shuffle_zone(z); } + +void add_to_free_area_random(struct page *page, struct free_area *area, + int migratetype) +{ + static u64 rand; + static u8 rand_bits; + + /* + * The lack of locking is deliberate. If 2 threads race to + * update the rand state it just adds to the entropy. + */ + if (rand_bits == 0) { + rand_bits = 64; + rand = get_random_u64(); + } + + if (rand & 1) + add_to_free_area(page, area, migratetype); + else + add_to_free_area_tail(page, area, migratetype); + rand_bits--; + rand >>= 1; +} diff --git a/mm/shuffle.h b/mm/shuffle.h index 644c8ee97b9e..fc1e327ae22d 100644 --- a/mm/shuffle.h +++ b/mm/shuffle.h @@ -36,6 +36,13 @@ static inline void shuffle_zone(struct zone *z) return; __shuffle_zone(z); } + +static inline bool is_shuffle_order(int order) +{ + if (!static_branch_unlikely(&page_alloc_shuffle_key)) + return false; + return order >= SHUFFLE_ORDER; +} #else static inline void shuffle_free_memory(pg_data_t *pgdat) { @@ -48,5 +55,10 @@ static inline void shuffle_zone(struct zone *z) static inline void page_alloc_shuffle(enum mm_shuffle_ctl ctl) { } + +static inline bool is_shuffle_order(int order) +{ + return false; +} #endif #endif /* _MM_SHUFFLE_H */