[v9,1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization

Randomization of the page allocator improves the average utilization of
a direct-mapped memory-side-cache. Memory side caching is a platform
capability that Linux has been previously exposed to in HPC
(high-performance computing) environments on specialty platforms. In
that instance it was a smaller pool of high-bandwidth-memory relative to
higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
be found on general purpose server platforms where DRAM is a cache in
front of higher latency persistent memory [1].

Robert offered an explanation of the state of the art of Linux
interactions with memory-side-caches [2], and I copy it here:

    It's been a problem in the HPC space:
    http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

    A kernel module called zonesort is available to try to help:
    https://software.intel.com/en-us/articles/xeon-phi-software

    and this abandoned patch series proposed that for the kernel:
    https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com

    Dan's patch series doesn't attempt to ensure buffers won't conflict, but
    also reduces the chance that the buffers will. This will make performance
    more consistent, albeit slower than "optimal" (which is near impossible
    to attain in a general-purpose kernel).  That's better than forcing
    users to deploy remedies like:
        "To eliminate this gradual degradation, we have added a Stream
         measurement to the Node Health Check that follows each job;
         nodes are rebooted whenever their measured memory bandwidth
         falls below 300 GB/s."

A replacement for zonesort was merged upstream in commit cc9aec03e58f
"x86/numa_emulation: Introduce uniform split capability". With this
numa_emulation capability, memory can be split into cache sized
("near-memory" sized) numa nodes. A bind operation to such a node, and
disabling workloads on other nodes, enables full cache performance.
However, once the workload exceeds the cache size then cache conflicts
are unavoidable. While HPC environments might be able to tolerate
time-scheduling of cache sized workloads, for general purpose server
platforms, the oversubscribed cache case will be the common case.

The worst case scenario is that a server system owner benchmarks a
workload at boot with an un-contended cache only to see that performance
degrade over time, even below the average cache performance due to
excessive conflicts. Randomization clips the peaks and fills in the
valleys of cache utilization to yield steady average performance.

Here are some performance impact details of the patches:

1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
3X speedup in a contrived case that tries to force cache conflicts. The
contrived cased used the numa_emulation capability to force an instance
of the benchmark to be run in two of the near-memory sized numa nodes.
If both instances were placed on the same emulated they would fit and
cause zero conflicts.  While on separate emulated nodes without
randomization they underutilized the cache and conflicted unnecessarily
due to the in-order allocation per node.

2/ A well known Java server application benchmark was run with a heap
size that exceeded cache size by 3X. The cache conflict rate was 8% for
the first run and degraded to 21% after page allocator aging. With
randomization enabled the rate levelled out at 11%.

3/ A MongoDB workload did not observe measurable difference in
cache-conflict rates, but the overall throughput dropped by 7% with
randomization in one case.

4/ Mel Gorman ran his suite of performance workloads with randomization
enabled on platforms without a memory-side-cache and saw a mix of some
improvements and some losses [3].

While there is potentially significant improvement for applications that
depend on low latency access across a wide working-set, the performance
may be negligible to negative for other workloads. For this reason the
shuffle capability defaults to off unless a direct-mapped
memory-side-cache is detected. Even then, the page_alloc.shuffle=0
parameter can be specified to disable the randomization on those
systems.

Outside of memory-side-cache utilization concerns there is potentially
security benefit from randomization. Some data exfiltration and
return-oriented-programming attacks rely on the ability to infer the
location of sensitive data objects. The kernel page allocator,
especially early in system boot, has predictable first-in-first out
behavior for physical pages. Pages are freed in physical address order
when first onlined.

Quoting Kees:
    "While we already have a base-address randomization
     (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
     memory layouts would certainly be using the predictability of
     allocation ordering (i.e. for attacks where the base address isn't
     important: only the relative positions between allocated memory).
     This is common in lots of heap-style attacks. They try to gain
     control over ordering by spraying allocations, etc.

     I'd really like to see this because it gives us something similar
     to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
caches it leaves vast bulk of memory to be predictably in order
allocated.  However, it should be noted, the concrete security benefits
are hard to quantify, and no known CVE is mitigated by this
randomization.

Introduce shuffle_free_memory(), and its helper shuffle_zone(), to
perform a Fisher-Yates shuffle of the page allocator 'free_area' lists
when they are initially populated with free memory at boot and at
hotplug time. Do this based on either the presence of a
page_alloc.shuffle=Y command line parameter, or autodetection of a
memory-side-cache (to be added in a follow-on patch).

The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e.
10, 4MB this trades off randomization granularity for time spent
shuffling.  MAX_ORDER-1 was chosen to be minimally invasive to the page
allocator while still showing memory-side cache behavior improvements,
and the expectation that the security implications of finer granularity
randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM.

The performance impact of the shuffling appears to be in the noise
compared to other memory initialization work. Also the bulk of the work
is done in the background as a part of deferred_init_memmap().

This initial randomization can be undone over time so a follow-on patch
is introduced to inject entropy on page free decisions. It is reasonable
to ask if the page free entropy is sufficient, but it is not enough due
to the in-order initial freeing of pages. At the start of that process
putting page1 in front or behind page0 still keeps them close together,
page2 is still near page1 and has a high chance of being adjacent. As
more pages are added ordering diversity improves, but there is still
high page locality for the low address pages and this leads to no
significant impact to the cache conflict rate.

[1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
[2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
[3]: https://lkml.org/lkml/2018/10/12/309

Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/list.h    |   17 ++++
 include/linux/mmzone.h  |    4 +
 include/linux/shuffle.h |   45 +++++++++++
 init/Kconfig            |   23 ++++++
 mm/Makefile             |    7 ++
 mm/memblock.c           |    1 
 mm/memory_hotplug.c     |    3 +
 mm/page_alloc.c         |    6 +-
 mm/shuffle.c            |  188 +++++++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 292 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/shuffle.h
 create mode 100644 mm/shuffle.c

Message ID	154882453604.1338686.15108059741397800728.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 910F9746 for <patchwork-linux-mm@patchwork.kernel.org>; Wed, 30 Jan 2019 05:14:59 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7BC232D091 for <patchwork-linux-mm@patchwork.kernel.org>; Wed, 30 Jan 2019 05:14:59 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 6B7DC2D1C4; Wed, 30 Jan 2019 05:14:59 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CB80F2D091 for <patchwork-linux-mm@patchwork.kernel.org>; Wed, 30 Jan 2019 05:14:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A8B488E0003; Wed, 30 Jan 2019 00:14:56 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id A136F8E0001; Wed, 30 Jan 2019 00:14:56 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 88DCB8E0003; Wed, 30 Jan 2019 00:14:56 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pl1-f200.google.com (mail-pl1-f200.google.com [209.85.214.200]) by kanga.kvack.org (Postfix) with ESMTP id 34C898E0001 for <linux-mm@kvack.org>; Wed, 30 Jan 2019 00:14:56 -0500 (EST) Received: by mail-pl1-f200.google.com with SMTP id x7so15951794pll.23 for <linux-mm@kvack.org>; Tue, 29 Jan 2019 21:14:56 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:subject:from :to:cc:date:message-id:in-reply-to:references:user-agent :mime-version:content-transfer-encoding; bh=wjn/vlHuDA13uUl0fqMDVPdmEz9ITHKIC2CgpjTwMBc=; b=r4vYx0ieRttP3Ovydf0i7R5rJ7PLUL/z6wl3zt69nJwSf/pShcafahk4LpR/yMg+SW 9ruIv2gnw2GG3e0OlIGJ+DkMMg6IhhchesLmwZh6Qm7Tr1gtmf7ncyb4yxIyU1OQxMsQ R98QSRdLlvnZZ5h1zcTSl8slChDewppm9ibRv4iwRiNICyWQyVGHL1Kyiakt5gkBMkB3 PPjlNxtPbRDb80iqTKKTSdYLVrC9O1Lu4Rz0ERTn+b/Mv0OYPv8UxdoC3uddQ8N5uYDM BMDDDa/p9/nefsTviflMNI73gdhLlL63Rf4+0TWbavy5nasiiNoWN/XLSqRPGGZq1YEx prjQ== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of dan.j.williams@intel.com designates 192.55.52.43 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Gm-Message-State: AJcUukdpIe3VSsWQejJMk0rSO99WNrtMRIWw3FLrqPr9b/TSzB31WqeF 2lzgyK/1StmHs075ykn/E7+mstlQjYc6yjIvFs0f4ShOaWAbGpITVZtc1uj1dtyCEqHZ1MUziTu ntWJI23XY11066SRoDxd3e9R0G7vQdz0/3vmXJ5b0ADiHep+IGjRnMUzaDt1MUz46PQ== X-Received: by 2002:a17:902:930b:: with SMTP id bc11mr29605753plb.17.1548825295798; Tue, 29 Jan 2019 21:14:55 -0800 (PST) X-Google-Smtp-Source: ALg8bN7qPdR1DYxLQPD7a3DnOPYDS066Vx0N3QNOTFj+2BF9E42fw23ZbAwUYFcR1piQ73LupriN X-Received: by 2002:a17:902:930b:: with SMTP id bc11mr29605708plb.17.1548825294460; Tue, 29 Jan 2019 21:14:54 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1548825294; cv=none; d=google.com; s=arc-20160816; b=TNvwk8gq2YtVfST38+9dy96PP+SMHN0Dm/VJpc+Z/4kEx/niwYcKhVut63m21xG7P0 azN9eMOhoyT3ErDv/NfJ1x1znzm9A6gXtseKi5oqIhYmLGsVCy3VBsELK3gu7fHXc7Kk yD2t4PxIl7OaC8m/Ftg4gKqlBOMDYHV6fQB3KKtw7OHOEaxsONLCwvqnjNeAbTey3OgM uQUrlKBzUHPgFEHQOMaeYsDnRKzNtIHq0XIvEovBlXujQtzw1LrVLhKnAPSdHzXWvKMy 9XDUvcDpqFNW8VKSljlnT5kfDBXGQojJJk4Qf5JpaY5N7iZWbG+ONwSIkLMMhoxWtbXI nDNw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:user-agent:references :in-reply-to:message-id:date:cc:to:from:subject; bh=wjn/vlHuDA13uUl0fqMDVPdmEz9ITHKIC2CgpjTwMBc=; b=tVLazfWuPkVWzpQBfRqK6GYGIQsNVx7NqZr2u2mMDt/6FZZ3ARILkZu1fXffbyODCs gW0zekp7eyg1Dt6MJ8LTaEvK10lWLrsTUEOwdCvy/ZC/ek3Kr+bijXL/fTOhQJkS8PRt eHjrftClg1QG8Ou4T7FBJvWVR8YuiSYpiniZjiJFkhLjx4tYjVVh1LrpQZC3sC2qCePQ dF9+CR1xWNO3JSyknZHLIEdKdCNpUT8rtFW2QQmRpMNsCJHoYJU4G2gtdXj6i9sljoQK 5YGNmIQMB8j8qsRXtwIjC+ZjUYaIU0Ro5Hq5ClsZbZS11lp/IHnImvS97un2IehUUt6B QjjA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of dan.j.williams@intel.com designates 192.55.52.43 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from mga05.intel.com (mga05.intel.com. [192.55.52.43]) by mx.google.com with ESMTPS id r1si542582plb.330.2019.01.29.21.14.54 for <linux-mm@kvack.org> (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 29 Jan 2019 21:14:54 -0800 (PST) Received-SPF: pass (google.com: domain of dan.j.williams@intel.com designates 192.55.52.43 as permitted sender) client-ip=192.55.52.43; Authentication-Results: mx.google.com; spf=pass (google.com: domain of dan.j.williams@intel.com designates 192.55.52.43 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga105.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 29 Jan 2019 21:14:53 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,539,1539673200"; d="scan'208";a="295562559" Received: from dwillia2-desk3.jf.intel.com (HELO dwillia2-desk3.amr.corp.intel.com) ([10.54.39.16]) by orsmga005.jf.intel.com with ESMTP; 29 Jan 2019 21:14:53 -0800 Subject: [PATCH v9 1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization From: Dan Williams <dan.j.williams@intel.com> To: akpm@linux-foundation.org Cc: Michal Hocko <mhocko@suse.com>, Dave Hansen <dave.hansen@linux.intel.com>, Mike Rapoport <rppt@linux.ibm.com>, Kees Cook <keescook@chromium.org>, linux-mm@kvack.org, linux-kernel@vger.kernel.org Date: Tue, 29 Jan 2019 21:02:16 -0800 Message-ID: <154882453604.1338686.15108059741397800728.stgit@dwillia2-desk3.amr.corp.intel.com> In-Reply-To: <154882453052.1338686.16411162273671426494.stgit@dwillia2-desk3.amr.corp.intel.com> References: <154882453052.1338686.16411162273671426494.stgit@dwillia2-desk3.amr.corp.intel.com> User-Agent: StGit/0.18-2-gc94f MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: <linux-mm.kvack.org> X-Virus-Scanned: ClamAV using ClamSMTP
Series	mm: Randomize free memory \| expand [v9,0/3] mm: Randomize free memory [v9,1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization [v9,2/3] mm: Move buddy list manipulations into helpers [v9,3/3] mm: Maintain randomization of page free lists

[v9,1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization

Commit Message

Comments

Patch