diff mbox series

[v9,1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization

Message ID 154882453604.1338686.15108059741397800728.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State New, archived
Headers show
Series mm: Randomize free memory | expand

Commit Message

Dan Williams Jan. 30, 2019, 5:02 a.m. UTC
Randomization of the page allocator improves the average utilization of
a direct-mapped memory-side-cache. Memory side caching is a platform
capability that Linux has been previously exposed to in HPC
(high-performance computing) environments on specialty platforms. In
that instance it was a smaller pool of high-bandwidth-memory relative to
higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
be found on general purpose server platforms where DRAM is a cache in
front of higher latency persistent memory [1].

Robert offered an explanation of the state of the art of Linux
interactions with memory-side-caches [2], and I copy it here:

    It's been a problem in the HPC space:
    http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

    A kernel module called zonesort is available to try to help:
    https://software.intel.com/en-us/articles/xeon-phi-software

    and this abandoned patch series proposed that for the kernel:
    https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com

    Dan's patch series doesn't attempt to ensure buffers won't conflict, but
    also reduces the chance that the buffers will. This will make performance
    more consistent, albeit slower than "optimal" (which is near impossible
    to attain in a general-purpose kernel).  That's better than forcing
    users to deploy remedies like:
        "To eliminate this gradual degradation, we have added a Stream
         measurement to the Node Health Check that follows each job;
         nodes are rebooted whenever their measured memory bandwidth
         falls below 300 GB/s."

A replacement for zonesort was merged upstream in commit cc9aec03e58f
"x86/numa_emulation: Introduce uniform split capability". With this
numa_emulation capability, memory can be split into cache sized
("near-memory" sized) numa nodes. A bind operation to such a node, and
disabling workloads on other nodes, enables full cache performance.
However, once the workload exceeds the cache size then cache conflicts
are unavoidable. While HPC environments might be able to tolerate
time-scheduling of cache sized workloads, for general purpose server
platforms, the oversubscribed cache case will be the common case.

The worst case scenario is that a server system owner benchmarks a
workload at boot with an un-contended cache only to see that performance
degrade over time, even below the average cache performance due to
excessive conflicts. Randomization clips the peaks and fills in the
valleys of cache utilization to yield steady average performance.

Here are some performance impact details of the patches:

1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
3X speedup in a contrived case that tries to force cache conflicts. The
contrived cased used the numa_emulation capability to force an instance
of the benchmark to be run in two of the near-memory sized numa nodes.
If both instances were placed on the same emulated they would fit and
cause zero conflicts.  While on separate emulated nodes without
randomization they underutilized the cache and conflicted unnecessarily
due to the in-order allocation per node.

2/ A well known Java server application benchmark was run with a heap
size that exceeded cache size by 3X. The cache conflict rate was 8% for
the first run and degraded to 21% after page allocator aging. With
randomization enabled the rate levelled out at 11%.

3/ A MongoDB workload did not observe measurable difference in
cache-conflict rates, but the overall throughput dropped by 7% with
randomization in one case.

4/ Mel Gorman ran his suite of performance workloads with randomization
enabled on platforms without a memory-side-cache and saw a mix of some
improvements and some losses [3].

While there is potentially significant improvement for applications that
depend on low latency access across a wide working-set, the performance
may be negligible to negative for other workloads. For this reason the
shuffle capability defaults to off unless a direct-mapped
memory-side-cache is detected. Even then, the page_alloc.shuffle=0
parameter can be specified to disable the randomization on those
systems.

Outside of memory-side-cache utilization concerns there is potentially
security benefit from randomization. Some data exfiltration and
return-oriented-programming attacks rely on the ability to infer the
location of sensitive data objects. The kernel page allocator,
especially early in system boot, has predictable first-in-first out
behavior for physical pages. Pages are freed in physical address order
when first onlined.

Quoting Kees:
    "While we already have a base-address randomization
     (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
     memory layouts would certainly be using the predictability of
     allocation ordering (i.e. for attacks where the base address isn't
     important: only the relative positions between allocated memory).
     This is common in lots of heap-style attacks. They try to gain
     control over ordering by spraying allocations, etc.

     I'd really like to see this because it gives us something similar
     to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
caches it leaves vast bulk of memory to be predictably in order
allocated.  However, it should be noted, the concrete security benefits
are hard to quantify, and no known CVE is mitigated by this
randomization.

Introduce shuffle_free_memory(), and its helper shuffle_zone(), to
perform a Fisher-Yates shuffle of the page allocator 'free_area' lists
when they are initially populated with free memory at boot and at
hotplug time. Do this based on either the presence of a
page_alloc.shuffle=Y command line parameter, or autodetection of a
memory-side-cache (to be added in a follow-on patch).

The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e.
10, 4MB this trades off randomization granularity for time spent
shuffling.  MAX_ORDER-1 was chosen to be minimally invasive to the page
allocator while still showing memory-side cache behavior improvements,
and the expectation that the security implications of finer granularity
randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM.

The performance impact of the shuffling appears to be in the noise
compared to other memory initialization work. Also the bulk of the work
is done in the background as a part of deferred_init_memmap().

This initial randomization can be undone over time so a follow-on patch
is introduced to inject entropy on page free decisions. It is reasonable
to ask if the page free entropy is sufficient, but it is not enough due
to the in-order initial freeing of pages. At the start of that process
putting page1 in front or behind page0 still keeps them close together,
page2 is still near page1 and has a high chance of being adjacent. As
more pages are added ordering diversity improves, but there is still
high page locality for the low address pages and this leads to no
significant impact to the cache conflict rate.

[1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
[2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
[3]: https://lkml.org/lkml/2018/10/12/309

Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/list.h    |   17 ++++
 include/linux/mmzone.h  |    4 +
 include/linux/shuffle.h |   45 +++++++++++
 init/Kconfig            |   23 ++++++
 mm/Makefile             |    7 ++
 mm/memblock.c           |    1 
 mm/memory_hotplug.c     |    3 +
 mm/page_alloc.c         |    6 +-
 mm/shuffle.c            |  188 +++++++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 292 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/shuffle.h
 create mode 100644 mm/shuffle.c

Comments

Mike Rapoport Jan. 30, 2019, 6:48 a.m. UTC | #1
On Tue, Jan 29, 2019 at 09:02:16PM -0800, Dan Williams wrote:
> Randomization of the page allocator improves the average utilization of
> a direct-mapped memory-side-cache. Memory side caching is a platform
> capability that Linux has been previously exposed to in HPC
> (high-performance computing) environments on specialty platforms. In
> that instance it was a smaller pool of high-bandwidth-memory relative to
> higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
> be found on general purpose server platforms where DRAM is a cache in
> front of higher latency persistent memory [1].

[ ... ]
 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  include/linux/list.h    |   17 ++++
>  include/linux/mmzone.h  |    4 +
>  include/linux/shuffle.h |   45 +++++++++++
>  init/Kconfig            |   23 ++++++
>  mm/Makefile             |    7 ++
>  mm/memblock.c           |    1 
>  mm/memory_hotplug.c     |    3 +
>  mm/page_alloc.c         |    6 +-
>  mm/shuffle.c            |  188 +++++++++++++++++++++++++++++++++++++++++++++++
>  9 files changed, 292 insertions(+), 2 deletions(-)
>  create mode 100644 include/linux/shuffle.h
>  create mode 100644 mm/shuffle.c

...

> diff --git a/mm/memblock.c b/mm/memblock.c
> index 022d4cbb3618..c0cfbfae4a03 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -17,6 +17,7 @@
>  #include <linux/poison.h>
>  #include <linux/pfn.h>
>  #include <linux/debugfs.h>
> +#include <linux/shuffle.h>

Nit: does not seem to be required

>  #include <linux/kmemleak.h>
>  #include <linux/seq_file.h>
>  #include <linux/memblock.h>
Dan Williams Jan. 30, 2019, 6:30 p.m. UTC | #2
On Tue, Jan 29, 2019 at 10:49 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> On Tue, Jan 29, 2019 at 09:02:16PM -0800, Dan Williams wrote:
> > Randomization of the page allocator improves the average utilization of
> > a direct-mapped memory-side-cache. Memory side caching is a platform
> > capability that Linux has been previously exposed to in HPC
> > (high-performance computing) environments on specialty platforms. In
> > that instance it was a smaller pool of high-bandwidth-memory relative to
> > higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
> > be found on general purpose server platforms where DRAM is a cache in
> > front of higher latency persistent memory [1].
>
> [ ... ]
>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > Cc: Mike Rapoport <rppt@linux.ibm.com>
> > Reviewed-by: Kees Cook <keescook@chromium.org>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > ---
> >  include/linux/list.h    |   17 ++++
> >  include/linux/mmzone.h  |    4 +
> >  include/linux/shuffle.h |   45 +++++++++++
> >  init/Kconfig            |   23 ++++++
> >  mm/Makefile             |    7 ++
> >  mm/memblock.c           |    1
> >  mm/memory_hotplug.c     |    3 +
> >  mm/page_alloc.c         |    6 +-
> >  mm/shuffle.c            |  188 +++++++++++++++++++++++++++++++++++++++++++++++
> >  9 files changed, 292 insertions(+), 2 deletions(-)
> >  create mode 100644 include/linux/shuffle.h
> >  create mode 100644 mm/shuffle.c
>
> ...
>
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index 022d4cbb3618..c0cfbfae4a03 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -17,6 +17,7 @@
> >  #include <linux/poison.h>
> >  #include <linux/pfn.h>
> >  #include <linux/debugfs.h>
> > +#include <linux/shuffle.h>
>
> Nit: does not seem to be required
>
> >  #include <linux/kmemleak.h>
> >  #include <linux/seq_file.h>
> >  #include <linux/memblock.h>

Good catch. Notch one more line saved in the incremental diffstat from
v8. I'll wait for Michal's thumbs up on the rest before re-spinning,
or perhaps Andrew can drop this line on applying?
Michal Hocko Jan. 30, 2019, 7:08 p.m. UTC | #3
On Tue 29-01-19 21:02:16, Dan Williams wrote:
> Randomization of the page allocator improves the average utilization of
> a direct-mapped memory-side-cache. Memory side caching is a platform
> capability that Linux has been previously exposed to in HPC
> (high-performance computing) environments on specialty platforms. In
> that instance it was a smaller pool of high-bandwidth-memory relative to
> higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
> be found on general purpose server platforms where DRAM is a cache in
> front of higher latency persistent memory [1].
> 
> Robert offered an explanation of the state of the art of Linux
> interactions with memory-side-caches [2], and I copy it here:
> 
>     It's been a problem in the HPC space:
>     http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/
> 
>     A kernel module called zonesort is available to try to help:
>     https://software.intel.com/en-us/articles/xeon-phi-software
> 
>     and this abandoned patch series proposed that for the kernel:
>     https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com
> 
>     Dan's patch series doesn't attempt to ensure buffers won't conflict, but
>     also reduces the chance that the buffers will. This will make performance
>     more consistent, albeit slower than "optimal" (which is near impossible
>     to attain in a general-purpose kernel).  That's better than forcing
>     users to deploy remedies like:
>         "To eliminate this gradual degradation, we have added a Stream
>          measurement to the Node Health Check that follows each job;
>          nodes are rebooted whenever their measured memory bandwidth
>          falls below 300 GB/s."
> 
> A replacement for zonesort was merged upstream in commit cc9aec03e58f
> "x86/numa_emulation: Introduce uniform split capability". With this
> numa_emulation capability, memory can be split into cache sized
> ("near-memory" sized) numa nodes. A bind operation to such a node, and
> disabling workloads on other nodes, enables full cache performance.
> However, once the workload exceeds the cache size then cache conflicts
> are unavoidable. While HPC environments might be able to tolerate
> time-scheduling of cache sized workloads, for general purpose server
> platforms, the oversubscribed cache case will be the common case.
> 
> The worst case scenario is that a server system owner benchmarks a
> workload at boot with an un-contended cache only to see that performance
> degrade over time, even below the average cache performance due to
> excessive conflicts. Randomization clips the peaks and fills in the
> valleys of cache utilization to yield steady average performance.
> 
> Here are some performance impact details of the patches:
> 
> 1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
> 3X speedup in a contrived case that tries to force cache conflicts. The
> contrived cased used the numa_emulation capability to force an instance
> of the benchmark to be run in two of the near-memory sized numa nodes.
> If both instances were placed on the same emulated they would fit and
> cause zero conflicts.  While on separate emulated nodes without
> randomization they underutilized the cache and conflicted unnecessarily
> due to the in-order allocation per node.
> 
> 2/ A well known Java server application benchmark was run with a heap
> size that exceeded cache size by 3X. The cache conflict rate was 8% for
> the first run and degraded to 21% after page allocator aging. With
> randomization enabled the rate levelled out at 11%.
> 
> 3/ A MongoDB workload did not observe measurable difference in
> cache-conflict rates, but the overall throughput dropped by 7% with
> randomization in one case.
> 
> 4/ Mel Gorman ran his suite of performance workloads with randomization
> enabled on platforms without a memory-side-cache and saw a mix of some
> improvements and some losses [3].
> 
> While there is potentially significant improvement for applications that
> depend on low latency access across a wide working-set, the performance
> may be negligible to negative for other workloads. For this reason the
> shuffle capability defaults to off unless a direct-mapped
> memory-side-cache is detected. Even then, the page_alloc.shuffle=0
> parameter can be specified to disable the randomization on those
> systems.
> 
> Outside of memory-side-cache utilization concerns there is potentially
> security benefit from randomization. Some data exfiltration and
> return-oriented-programming attacks rely on the ability to infer the
> location of sensitive data objects. The kernel page allocator,
> especially early in system boot, has predictable first-in-first out
> behavior for physical pages. Pages are freed in physical address order
> when first onlined.
> 
> Quoting Kees:
>     "While we already have a base-address randomization
>      (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
>      memory layouts would certainly be using the predictability of
>      allocation ordering (i.e. for attacks where the base address isn't
>      important: only the relative positions between allocated memory).
>      This is common in lots of heap-style attacks. They try to gain
>      control over ordering by spraying allocations, etc.
> 
>      I'd really like to see this because it gives us something similar
>      to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."
> 
> While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
> caches it leaves vast bulk of memory to be predictably in order
> allocated.  However, it should be noted, the concrete security benefits
> are hard to quantify, and no known CVE is mitigated by this
> randomization.
> 
> Introduce shuffle_free_memory(), and its helper shuffle_zone(), to
> perform a Fisher-Yates shuffle of the page allocator 'free_area' lists
> when they are initially populated with free memory at boot and at
> hotplug time. Do this based on either the presence of a
> page_alloc.shuffle=Y command line parameter, or autodetection of a
> memory-side-cache (to be added in a follow-on patch).
> 
> The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
> pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e.
> 10, 4MB this trades off randomization granularity for time spent
> shuffling.  MAX_ORDER-1 was chosen to be minimally invasive to the page
> allocator while still showing memory-side cache behavior improvements,
> and the expectation that the security implications of finer granularity
> randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM.
> 
> The performance impact of the shuffling appears to be in the noise
> compared to other memory initialization work. Also the bulk of the work
> is done in the background as a part of deferred_init_memmap().

The last part is not true with this version anymore, right?

> This initial randomization can be undone over time so a follow-on patch
> is introduced to inject entropy on page free decisions. It is reasonable
> to ask if the page free entropy is sufficient, but it is not enough due
> to the in-order initial freeing of pages. At the start of that process
> putting page1 in front or behind page0 still keeps them close together,
> page2 is still near page1 and has a high chance of being adjacent. As
> more pages are added ordering diversity improves, but there is still
> high page locality for the low address pages and this leads to no
> significant impact to the cache conflict rate.

I find mm_shuffle_ctl a bit confusing because the mode of operation is
either AUTO (enabled when the HW is present) or FORCE_ENABLE when
explicitly enabled by the command line. Nothing earth shattering though.

> [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
> [2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
> [3]: https://lkml.org/lkml/2018/10/12/309
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Other than that, I haven't spotted any fundamental issues. The feature
is a hack but I do agree that it might be useful for the specific HW it
is going to be used for. I still think that shuffling only top orders
has close to zero security benefits because it is not that hard to
control the memory fragmentation.

With that
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/list.h    |   17 ++++
>  include/linux/mmzone.h  |    4 +
>  include/linux/shuffle.h |   45 +++++++++++
>  init/Kconfig            |   23 ++++++
>  mm/Makefile             |    7 ++
>  mm/memblock.c           |    1 
>  mm/memory_hotplug.c     |    3 +
>  mm/page_alloc.c         |    6 +-
>  mm/shuffle.c            |  188 +++++++++++++++++++++++++++++++++++++++++++++++
>  9 files changed, 292 insertions(+), 2 deletions(-)
>  create mode 100644 include/linux/shuffle.h
>  create mode 100644 mm/shuffle.c
> 
> diff --git a/include/linux/list.h b/include/linux/list.h
> index edb7628e46ed..3dfb8953f241 100644
> --- a/include/linux/list.h
> +++ b/include/linux/list.h
> @@ -150,6 +150,23 @@ static inline void list_replace_init(struct list_head *old,
>  	INIT_LIST_HEAD(old);
>  }
>  
> +/**
> + * list_swap - replace entry1 with entry2 and re-add entry1 at entry2's position
> + * @entry1: the location to place entry2
> + * @entry2: the location to place entry1
> + */
> +static inline void list_swap(struct list_head *entry1,
> +			     struct list_head *entry2)
> +{
> +	struct list_head *pos = entry2->prev;
> +
> +	list_del(entry2);
> +	list_replace(entry1, entry2);
> +	if (pos == entry1)
> +		pos = entry2;
> +	list_add(entry1, pos);
> +}
> +
>  /**
>   * list_del_init - deletes entry from list and reinitialize it.
>   * @entry: the element to delete from the list.
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index cc4a507d7ca4..374e9d483382 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1272,6 +1272,10 @@ void sparse_init(void);
>  #else
>  #define sparse_init()	do {} while (0)
>  #define sparse_index_init(_sec, _nid)  do {} while (0)
> +static inline int pfn_present(unsigned long pfn)
> +{
> +	return pfn_valid(pfn);
> +}
>  #endif /* CONFIG_SPARSEMEM */
>  
>  /*
> diff --git a/include/linux/shuffle.h b/include/linux/shuffle.h
> new file mode 100644
> index 000000000000..bed2d2901d13
> --- /dev/null
> +++ b/include/linux/shuffle.h
> @@ -0,0 +1,45 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright(c) 2018 Intel Corporation. All rights reserved.
> +#ifndef _MM_SHUFFLE_H
> +#define _MM_SHUFFLE_H
> +#include <linux/jump_label.h>
> +
> +enum mm_shuffle_ctl {
> +	SHUFFLE_ENABLE,
> +	SHUFFLE_FORCE_DISABLE,
> +};
> +
> +#define SHUFFLE_ORDER (MAX_ORDER-1)
> +
> +#ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR
> +DECLARE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
> +extern void page_alloc_shuffle(enum mm_shuffle_ctl ctl);
> +extern void __shuffle_free_memory(pg_data_t *pgdat);
> +static inline void shuffle_free_memory(pg_data_t *pgdat)
> +{
> +	if (!static_branch_unlikely(&page_alloc_shuffle_key))
> +		return;
> +	__shuffle_free_memory(pgdat);
> +}
> +
> +extern void __shuffle_zone(struct zone *z);
> +static inline void shuffle_zone(struct zone *z)
> +{
> +	if (!static_branch_unlikely(&page_alloc_shuffle_key))
> +		return;
> +	__shuffle_zone(z);
> +}
> +#else
> +static inline void shuffle_free_memory(pg_data_t *pgdat)
> +{
> +}
> +
> +static inline void shuffle_zone(struct zone *z)
> +{
> +}
> +
> +static inline void page_alloc_shuffle(enum mm_shuffle_ctl ctl)
> +{
> +}
> +#endif
> +#endif /* _MM_SHUFFLE_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index d47cb77a220e..cfa199f3e9be 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1714,6 +1714,29 @@ config SLAB_FREELIST_HARDENED
>  	  sacrifies to harden the kernel slab allocator against common
>  	  freelist exploit methods.
>  
> +config SHUFFLE_PAGE_ALLOCATOR
> +	bool "Page allocator randomization"
> +	default SLAB_FREELIST_RANDOM && ACPI_NUMA
> +	help
> +	  Randomization of the page allocator improves the average
> +	  utilization of a direct-mapped memory-side-cache. See section
> +	  5.2.27 Heterogeneous Memory Attribute Table (HMAT) in the ACPI
> +	  6.2a specification for an example of how a platform advertises
> +	  the presence of a memory-side-cache. There are also incidental
> +	  security benefits as it reduces the predictability of page
> +	  allocations to compliment SLAB_FREELIST_RANDOM, but the
> +	  default granularity of shuffling on 4MB (MAX_ORDER) pages is
> +	  selected based on cache utilization benefits.
> +
> +	  While the randomization improves cache utilization it may
> +	  negatively impact workloads on platforms without a cache. For
> +	  this reason, by default, the randomization is enabled only
> +	  after runtime detection of a direct-mapped memory-side-cache.
> +	  Otherwise, the randomization may be force enabled with the
> +	  'page_alloc.shuffle' kernel command line parameter.
> +
> +	  Say Y if unsure.
> +
>  config SLUB_CPU_PARTIAL
>  	default y
>  	depends on SLUB && SMP
> diff --git a/mm/Makefile b/mm/Makefile
> index d210cc9d6f80..ac5e5ba78874 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -33,7 +33,7 @@ mmu-$(CONFIG_MMU)	+= process_vm_access.o
>  endif
>  
>  obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
> -			   maccess.o page_alloc.o page-writeback.o \
> +			   maccess.o page-writeback.o \
>  			   readahead.o swap.o truncate.o vmscan.o shmem.o \
>  			   util.o mmzone.o vmstat.o backing-dev.o \
>  			   mm_init.o mmu_context.o percpu.o slab_common.o \
> @@ -41,6 +41,11 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
>  			   interval_tree.o list_lru.o workingset.o \
>  			   debug.o $(mmu-y)
>  
> +# Give 'page_alloc' its own module-parameter namespace
> +page-alloc-y := page_alloc.o
> +page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
> +
> +obj-y += page-alloc.o
>  obj-y += init-mm.o
>  obj-y += memblock.o
>  
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 022d4cbb3618..c0cfbfae4a03 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -17,6 +17,7 @@
>  #include <linux/poison.h>
>  #include <linux/pfn.h>
>  #include <linux/debugfs.h>
> +#include <linux/shuffle.h>
>  #include <linux/kmemleak.h>
>  #include <linux/seq_file.h>
>  #include <linux/memblock.h>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index b9a667d36c55..07732be3065e 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -23,6 +23,7 @@
>  #include <linux/highmem.h>
>  #include <linux/vmalloc.h>
>  #include <linux/ioport.h>
> +#include <linux/shuffle.h>
>  #include <linux/delay.h>
>  #include <linux/migrate.h>
>  #include <linux/page-isolation.h>
> @@ -895,6 +896,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
>  	zone->zone_pgdat->node_present_pages += onlined_pages;
>  	pgdat_resize_unlock(zone->zone_pgdat, &flags);
>  
> +	shuffle_zone(zone);
> +
>  	if (onlined_pages) {
>  		node_states_set_node(nid, &arg);
>  		if (need_zonelists_rebuild)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index cde5dac6229a..6208ff744b07 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -61,6 +61,7 @@
>  #include <linux/sched/rt.h>
>  #include <linux/sched/mm.h>
>  #include <linux/page_owner.h>
> +#include <linux/shuffle.h>
>  #include <linux/kthread.h>
>  #include <linux/memcontrol.h>
>  #include <linux/ftrace.h>
> @@ -1752,9 +1753,9 @@ _deferred_grow_zone(struct zone *zone, unsigned int order)
>  void __init page_alloc_init_late(void)
>  {
>  	struct zone *zone;
> +	int nid;
>  
>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> -	int nid;
>  
>  	/* There will be num_node_state(N_MEMORY) threads */
>  	atomic_set(&pgdat_init_n_undone, num_node_state(N_MEMORY));
> @@ -1779,6 +1780,9 @@ void __init page_alloc_init_late(void)
>  	memblock_discard();
>  #endif
>  
> +	for_each_node_state(nid, N_MEMORY)
> +		shuffle_free_memory(NODE_DATA(nid));
> +
>  	for_each_populated_zone(zone)
>  		set_zone_contiguous(zone);
>  }
> diff --git a/mm/shuffle.c b/mm/shuffle.c
> new file mode 100644
> index 000000000000..db517cdbaebe
> --- /dev/null
> +++ b/mm/shuffle.c
> @@ -0,0 +1,188 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright(c) 2018 Intel Corporation. All rights reserved.
> +
> +#include <linux/mm.h>
> +#include <linux/init.h>
> +#include <linux/mmzone.h>
> +#include <linux/random.h>
> +#include <linux/shuffle.h>
> +#include <linux/moduleparam.h>
> +#include "internal.h"
> +
> +DEFINE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
> +static unsigned long shuffle_state __ro_after_init;
> +
> +/*
> + * Depending on the architecture, module parameter parsing may run
> + * before, or after the cache detection. SHUFFLE_FORCE_DISABLE prevents,
> + * or reverts the enabling of the shuffle implementation. SHUFFLE_ENABLE
> + * attempts to turn on the implementation, but aborts if it finds
> + * SHUFFLE_FORCE_DISABLE already set.
> + */
> +void page_alloc_shuffle(enum mm_shuffle_ctl ctl)
> +{
> +	if (ctl == SHUFFLE_FORCE_DISABLE)
> +		set_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state);
> +
> +	if (test_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state)) {
> +		if (test_and_clear_bit(SHUFFLE_ENABLE, &shuffle_state))
> +			static_branch_disable(&page_alloc_shuffle_key);
> +	} else if (ctl == SHUFFLE_ENABLE
> +			&& !test_and_set_bit(SHUFFLE_ENABLE, &shuffle_state))
> +		static_branch_enable(&page_alloc_shuffle_key);
> +}
> +
> +static bool shuffle_param;
> +extern int shuffle_show(char *buffer, const struct kernel_param *kp)
> +{
> +	return sprintf(buffer, "%c\n", test_bit(SHUFFLE_ENABLE, &shuffle_state)
> +			? 'Y' : 'N');
> +}
> +static int shuffle_store(const char *val, const struct kernel_param *kp)
> +{
> +	int rc = param_set_bool(val, kp);
> +
> +	if (rc < 0)
> +		return rc;
> +	if (shuffle_param)
> +		page_alloc_shuffle(SHUFFLE_ENABLE);
> +	else
> +		page_alloc_shuffle(SHUFFLE_FORCE_DISABLE);
> +	return 0;
> +}
> +module_param_call(shuffle, shuffle_store, shuffle_show, &shuffle_param, 0400);
> +
> +/*
> + * For two pages to be swapped in the shuffle, they must be free (on a
> + * 'free_area' lru), have the same order, and have the same migratetype.
> + */
> +static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order)
> +{
> +	struct page *page;
> +
> +	/*
> +	 * Given we're dealing with randomly selected pfns in a zone we
> +	 * need to ask questions like...
> +	 */
> +
> +	/* ...is the pfn even in the memmap? */
> +	if (!pfn_valid_within(pfn))
> +		return NULL;
> +
> +	/* ...is the pfn in a present section or a hole? */
> +	if (!pfn_present(pfn))
> +		return NULL;
> +
> +	/* ...is the page free and currently on a free_area list? */
> +	page = pfn_to_page(pfn);
> +	if (!PageBuddy(page))
> +		return NULL;
> +
> +	/*
> +	 * ...is the page on the same list as the page we will
> +	 * shuffle it with?
> +	 */
> +	if (page_order(page) != order)
> +		return NULL;
> +
> +	return page;
> +}
> +
> +/*
> + * Fisher-Yates shuffle the freelist which prescribes iterating through
> + * an array, pfns in this case, and randomly swapping each entry with
> + * another in the span, end_pfn - start_pfn.
> + *
> + * To keep the implementation simple it does not attempt to correct for
> + * sources of bias in the distribution, like modulo bias or
> + * pseudo-random number generator bias. I.e. the expectation is that
> + * this shuffling raises the bar for attacks that exploit the
> + * predictability of page allocations, but need not be a perfect
> + * shuffle.
> + */
> +#define SHUFFLE_RETRY 10
> +void __meminit __shuffle_zone(struct zone *z)
> +{
> +	unsigned long i, flags;
> +	unsigned long start_pfn = z->zone_start_pfn;
> +	unsigned long end_pfn = zone_end_pfn(z);
> +	const int order = SHUFFLE_ORDER;
> +	const int order_pages = 1 << order;
> +
> +	spin_lock_irqsave(&z->lock, flags);
> +	start_pfn = ALIGN(start_pfn, order_pages);
> +	for (i = start_pfn; i < end_pfn; i += order_pages) {
> +		unsigned long j;
> +		int migratetype, retry;
> +		struct page *page_i, *page_j;
> +
> +		/*
> +		 * We expect page_i, in the sub-range of a zone being
> +		 * added (@start_pfn to @end_pfn), to more likely be
> +		 * valid compared to page_j randomly selected in the
> +		 * span @zone_start_pfn to @spanned_pages.
> +		 */
> +		page_i = shuffle_valid_page(i, order);
> +		if (!page_i)
> +			continue;
> +
> +		for (retry = 0; retry < SHUFFLE_RETRY; retry++) {
> +			/*
> +			 * Pick a random order aligned page from the
> +			 * start of the zone. Use the *whole* zone here
> +			 * so that if it is freed in tiny pieces that we
> +			 * randomize in the whole zone, not just within
> +			 * those fragments.
> +			 *
> +			 * Since page_j comes from a potentially sparse
> +			 * address range we want to try a bit harder to
> +			 * find a shuffle point for page_i.
> +			 */
> +			j = z->zone_start_pfn +
> +				ALIGN_DOWN(get_random_long() % z->spanned_pages,
> +						order_pages);
> +			page_j = shuffle_valid_page(j, order);
> +			if (page_j && page_j != page_i)
> +				break;
> +		}
> +		if (retry >= SHUFFLE_RETRY) {
> +			pr_debug("%s: failed to swap %#lx\n", __func__, i);
> +			continue;
> +		}
> +
> +		/*
> +		 * Each migratetype corresponds to its own list, make
> +		 * sure the types match otherwise we're moving pages to
> +		 * lists where they do not belong.
> +		 */
> +		migratetype = get_pageblock_migratetype(page_i);
> +		if (get_pageblock_migratetype(page_j) != migratetype) {
> +			pr_debug("%s: migratetype mismatch %#lx\n", __func__, i);
> +			continue;
> +		}
> +
> +		list_swap(&page_i->lru, &page_j->lru);
> +
> +		pr_debug("%s: swap: %#lx -> %#lx\n", __func__, i, j);
> +
> +		/* take it easy on the zone lock */
> +		if ((i % (100 * order_pages)) == 0) {
> +			spin_unlock_irqrestore(&z->lock, flags);
> +			cond_resched();
> +			spin_lock_irqsave(&z->lock, flags);
> +		}
> +	}
> +	spin_unlock_irqrestore(&z->lock, flags);
> +}
> +
> +/**
> + * shuffle_free_memory - reduce the predictability of the page allocator
> + * @pgdat: node page data
> + */
> +void __meminit __shuffle_free_memory(pg_data_t *pgdat)
> +{
> +	struct zone *z;
> +
> +	for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
> +		shuffle_zone(z);
> +}
>
Dan Williams Jan. 31, 2019, 1:33 a.m. UTC | #4
On Wed, Jan 30, 2019 at 11:08 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 29-01-19 21:02:16, Dan Williams wrote:
> > Randomization of the page allocator improves the average utilization of
> > a direct-mapped memory-side-cache. Memory side caching is a platform
> > capability that Linux has been previously exposed to in HPC
> > (high-performance computing) environments on specialty platforms. In
> > that instance it was a smaller pool of high-bandwidth-memory relative to
> > higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
> > be found on general purpose server platforms where DRAM is a cache in
> > front of higher latency persistent memory [1].
> >
> > Robert offered an explanation of the state of the art of Linux
> > interactions with memory-side-caches [2], and I copy it here:
> >
> >     It's been a problem in the HPC space:
> >     http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/
> >
> >     A kernel module called zonesort is available to try to help:
> >     https://software.intel.com/en-us/articles/xeon-phi-software
> >
> >     and this abandoned patch series proposed that for the kernel:
> >     https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com
> >
> >     Dan's patch series doesn't attempt to ensure buffers won't conflict, but
> >     also reduces the chance that the buffers will. This will make performance
> >     more consistent, albeit slower than "optimal" (which is near impossible
> >     to attain in a general-purpose kernel).  That's better than forcing
> >     users to deploy remedies like:
> >         "To eliminate this gradual degradation, we have added a Stream
> >          measurement to the Node Health Check that follows each job;
> >          nodes are rebooted whenever their measured memory bandwidth
> >          falls below 300 GB/s."
> >
> > A replacement for zonesort was merged upstream in commit cc9aec03e58f
> > "x86/numa_emulation: Introduce uniform split capability". With this
> > numa_emulation capability, memory can be split into cache sized
> > ("near-memory" sized) numa nodes. A bind operation to such a node, and
> > disabling workloads on other nodes, enables full cache performance.
> > However, once the workload exceeds the cache size then cache conflicts
> > are unavoidable. While HPC environments might be able to tolerate
> > time-scheduling of cache sized workloads, for general purpose server
> > platforms, the oversubscribed cache case will be the common case.
> >
> > The worst case scenario is that a server system owner benchmarks a
> > workload at boot with an un-contended cache only to see that performance
> > degrade over time, even below the average cache performance due to
> > excessive conflicts. Randomization clips the peaks and fills in the
> > valleys of cache utilization to yield steady average performance.
> >
> > Here are some performance impact details of the patches:
> >
> > 1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
> > 3X speedup in a contrived case that tries to force cache conflicts. The
> > contrived cased used the numa_emulation capability to force an instance
> > of the benchmark to be run in two of the near-memory sized numa nodes.
> > If both instances were placed on the same emulated they would fit and
> > cause zero conflicts.  While on separate emulated nodes without
> > randomization they underutilized the cache and conflicted unnecessarily
> > due to the in-order allocation per node.
> >
> > 2/ A well known Java server application benchmark was run with a heap
> > size that exceeded cache size by 3X. The cache conflict rate was 8% for
> > the first run and degraded to 21% after page allocator aging. With
> > randomization enabled the rate levelled out at 11%.
> >
> > 3/ A MongoDB workload did not observe measurable difference in
> > cache-conflict rates, but the overall throughput dropped by 7% with
> > randomization in one case.
> >
> > 4/ Mel Gorman ran his suite of performance workloads with randomization
> > enabled on platforms without a memory-side-cache and saw a mix of some
> > improvements and some losses [3].
> >
> > While there is potentially significant improvement for applications that
> > depend on low latency access across a wide working-set, the performance
> > may be negligible to negative for other workloads. For this reason the
> > shuffle capability defaults to off unless a direct-mapped
> > memory-side-cache is detected. Even then, the page_alloc.shuffle=0
> > parameter can be specified to disable the randomization on those
> > systems.
> >
> > Outside of memory-side-cache utilization concerns there is potentially
> > security benefit from randomization. Some data exfiltration and
> > return-oriented-programming attacks rely on the ability to infer the
> > location of sensitive data objects. The kernel page allocator,
> > especially early in system boot, has predictable first-in-first out
> > behavior for physical pages. Pages are freed in physical address order
> > when first onlined.
> >
> > Quoting Kees:
> >     "While we already have a base-address randomization
> >      (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
> >      memory layouts would certainly be using the predictability of
> >      allocation ordering (i.e. for attacks where the base address isn't
> >      important: only the relative positions between allocated memory).
> >      This is common in lots of heap-style attacks. They try to gain
> >      control over ordering by spraying allocations, etc.
> >
> >      I'd really like to see this because it gives us something similar
> >      to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."
> >
> > While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
> > caches it leaves vast bulk of memory to be predictably in order
> > allocated.  However, it should be noted, the concrete security benefits
> > are hard to quantify, and no known CVE is mitigated by this
> > randomization.
> >
> > Introduce shuffle_free_memory(), and its helper shuffle_zone(), to
> > perform a Fisher-Yates shuffle of the page allocator 'free_area' lists
> > when they are initially populated with free memory at boot and at
> > hotplug time. Do this based on either the presence of a
> > page_alloc.shuffle=Y command line parameter, or autodetection of a
> > memory-side-cache (to be added in a follow-on patch).
> >
> > The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
> > pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e.
> > 10, 4MB this trades off randomization granularity for time spent
> > shuffling.  MAX_ORDER-1 was chosen to be minimally invasive to the page
> > allocator while still showing memory-side cache behavior improvements,
> > and the expectation that the security implications of finer granularity
> > randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM.
> >
> > The performance impact of the shuffling appears to be in the noise
> > compared to other memory initialization work. Also the bulk of the work
> > is done in the background as a part of deferred_init_memmap().
>
> The last part is not true with this version anymore, right?

True, and given that page_alloc_init_late() is waiting for it complete
the impact is no different from v8 to v9. I'll drop that sentence from
the changelog.

>
> > This initial randomization can be undone over time so a follow-on patch
> > is introduced to inject entropy on page free decisions. It is reasonable
> > to ask if the page free entropy is sufficient, but it is not enough due
> > to the in-order initial freeing of pages. At the start of that process
> > putting page1 in front or behind page0 still keeps them close together,
> > page2 is still near page1 and has a high chance of being adjacent. As
> > more pages are added ordering diversity improves, but there is still
> > high page locality for the low address pages and this leads to no
> > significant impact to the cache conflict rate.
>
> I find mm_shuffle_ctl a bit confusing because the mode of operation is
> either AUTO (enabled when the HW is present) or FORCE_ENABLE when
> explicitly enabled by the command line. Nothing earth shattering though.

Yeah, it's named from the perspective of the kernel internal usage
which is flipped from the user facing interaction. ENABLE is called
from the command line handler and in a follow-on patch the parser of
the platform-firmware table indicating the presence of a cache.
FORCE_DISABLE is only called from the command line handler. I'll add a
comment to this effect.

>
> > [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
> > [2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
> > [3]: https://lkml.org/lkml/2018/10/12/309
> >
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > Cc: Mike Rapoport <rppt@linux.ibm.com>
> > Reviewed-by: Kees Cook <keescook@chromium.org>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>
> Other than that, I haven't spotted any fundamental issues. The feature
> is a hack but I do agree that it might be useful for the specific HW it
> is going to be used for. I still think that shuffling only top orders
> has close to zero security benefits because it is not that hard to
> control the memory fragmentation.
>
> With that
> Acked-by: Michal Hocko <mhocko@suse.com>

Much appreciated.
Andrew Morton Jan. 31, 2019, 10:14 p.m. UTC | #5
On Tue, 29 Jan 2019 21:02:16 -0800 Dan Williams <dan.j.williams@intel.com> wrote:

> Randomization of the page allocator improves the average utilization of
> a direct-mapped memory-side-cache. Memory side caching is a platform
> capability that Linux has been previously exposed to in HPC
> (high-performance computing) environments on specialty platforms. In
> that instance it was a smaller pool of high-bandwidth-memory relative to
> higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
> be found on general purpose server platforms where DRAM is a cache in
> front of higher latency persistent memory [1].
> 
> Robert offered an explanation of the state of the art of Linux
> interactions with memory-side-caches [2], and I copy it here:
> 
>     It's been a problem in the HPC space:
>     http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/
> 
>     A kernel module called zonesort is available to try to help:
>     https://software.intel.com/en-us/articles/xeon-phi-software
> 
>     and this abandoned patch series proposed that for the kernel:
>     https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com
> 
>     Dan's patch series doesn't attempt to ensure buffers won't conflict, but
>     also reduces the chance that the buffers will. This will make performance
>     more consistent, albeit slower than "optimal" (which is near impossible
>     to attain in a general-purpose kernel).  That's better than forcing
>     users to deploy remedies like:
>         "To eliminate this gradual degradation, we have added a Stream
>          measurement to the Node Health Check that follows each job;
>          nodes are rebooted whenever their measured memory bandwidth
>          falls below 300 GB/s."
> 
> A replacement for zonesort was merged upstream in commit cc9aec03e58f
> "x86/numa_emulation: Introduce uniform split capability". With this
> numa_emulation capability, memory can be split into cache sized
> ("near-memory" sized) numa nodes. A bind operation to such a node, and
> disabling workloads on other nodes, enables full cache performance.
> However, once the workload exceeds the cache size then cache conflicts
> are unavoidable. While HPC environments might be able to tolerate
> time-scheduling of cache sized workloads, for general purpose server
> platforms, the oversubscribed cache case will be the common case.
> 
> The worst case scenario is that a server system owner benchmarks a
> workload at boot with an un-contended cache only to see that performance
> degrade over time, even below the average cache performance due to
> excessive conflicts. Randomization clips the peaks and fills in the
> valleys of cache utilization to yield steady average performance.
> 
> Here are some performance impact details of the patches:
> 
> 1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
> 3X speedup in a contrived case that tries to force cache conflicts. The
> contrived cased used the numa_emulation capability to force an instance
> of the benchmark to be run in two of the near-memory sized numa nodes.
> If both instances were placed on the same emulated they would fit and
> cause zero conflicts.  While on separate emulated nodes without
> randomization they underutilized the cache and conflicted unnecessarily
> due to the in-order allocation per node.
> 
> 2/ A well known Java server application benchmark was run with a heap
> size that exceeded cache size by 3X. The cache conflict rate was 8% for
> the first run and degraded to 21% after page allocator aging. With
> randomization enabled the rate levelled out at 11%.
> 
> 3/ A MongoDB workload did not observe measurable difference in
> cache-conflict rates, but the overall throughput dropped by 7% with
> randomization in one case.
> 
> 4/ Mel Gorman ran his suite of performance workloads with randomization
> enabled on platforms without a memory-side-cache and saw a mix of some
> improvements and some losses [3].
> 
> While there is potentially significant improvement for applications that
> depend on low latency access across a wide working-set, the performance
> may be negligible to negative for other workloads. For this reason the
> shuffle capability defaults to off unless a direct-mapped
> memory-side-cache is detected. Even then, the page_alloc.shuffle=0
> parameter can be specified to disable the randomization on those
> systems.
> 
> Outside of memory-side-cache utilization concerns there is potentially
> security benefit from randomization. Some data exfiltration and
> return-oriented-programming attacks rely on the ability to infer the
> location of sensitive data objects. The kernel page allocator,
> especially early in system boot, has predictable first-in-first out
> behavior for physical pages. Pages are freed in physical address order
> when first onlined.
> 
> Quoting Kees:
>     "While we already have a base-address randomization
>      (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
>      memory layouts would certainly be using the predictability of
>      allocation ordering (i.e. for attacks where the base address isn't
>      important: only the relative positions between allocated memory).
>      This is common in lots of heap-style attacks. They try to gain
>      control over ordering by spraying allocations, etc.
> 
>      I'd really like to see this because it gives us something similar
>      to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."
> 
> While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
> caches it leaves vast bulk of memory to be predictably in order
> allocated.  However, it should be noted, the concrete security benefits
> are hard to quantify, and no known CVE is mitigated by this
> randomization.
> 
> Introduce shuffle_free_memory(), and its helper shuffle_zone(), to
> perform a Fisher-Yates shuffle of the page allocator 'free_area' lists
> when they are initially populated with free memory at boot and at
> hotplug time. Do this based on either the presence of a
> page_alloc.shuffle=Y command line parameter, or autodetection of a
> memory-side-cache (to be added in a follow-on patch).

This is unfortunate from a testing and coverage point of view.  At
least initially it is desirable that all testers run this feature.

Also, it's unfortunate that enableing the feature requires a reboot. 
What happens if we do away with the boot-time (and maybe hotplug-time)
randomization and permit the feature to be switched on/off at runtime?

> The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
> pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e.
> 10, 4MB this trades off randomization granularity for time spent
> shuffling.  MAX_ORDER-1 was chosen to be minimally invasive to the page
> allocator while still showing memory-side cache behavior improvements,
> and the expectation that the security implications of finer granularity
> randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM.
> 
> The performance impact of the shuffling appears to be in the noise
> compared to other memory initialization work. Also the bulk of the work
> is done in the background as a part of deferred_init_memmap().
> 
> This initial randomization can be undone over time so a follow-on patch
> is introduced to inject entropy on page free decisions. It is reasonable
> to ask if the page free entropy is sufficient, but it is not enough due
> to the in-order initial freeing of pages. At the start of that process
> putting page1 in front or behind page0 still keeps them close together,
> page2 is still near page1 and has a high chance of being adjacent. As
> more pages are added ordering diversity improves, but there is still
> high page locality for the low address pages and this leads to no
> significant impact to the cache conflict rate.
> 
> ...
>
>  include/linux/list.h    |   17 ++++
>  include/linux/mmzone.h  |    4 +
>  include/linux/shuffle.h |   45 +++++++++++
>  init/Kconfig            |   23 ++++++
>  mm/Makefile             |    7 ++
>  mm/memblock.c           |    1 
>  mm/memory_hotplug.c     |    3 +
>  mm/page_alloc.c         |    6 +-
>  mm/shuffle.c            |  188 +++++++++++++++++++++++++++++++++++++++++++++++

Can we get a Documentation update for the new kernel parameter?

> 
> ...
>
> --- /dev/null
> +++ b/mm/shuffle.c
> @@ -0,0 +1,188 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright(c) 2018 Intel Corporation. All rights reserved.
> +
> +#include <linux/mm.h>
> +#include <linux/init.h>
> +#include <linux/mmzone.h>
> +#include <linux/random.h>
> +#include <linux/shuffle.h>

Does shuffle.h need to be available to the whole kernel or can we put
it in mm/?

> +#include <linux/moduleparam.h>
> +#include "internal.h"
> +
> +DEFINE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
> +static unsigned long shuffle_state __ro_after_init;
> +
> +/*
> + * Depending on the architecture, module parameter parsing may run
> + * before, or after the cache detection. SHUFFLE_FORCE_DISABLE prevents,
> + * or reverts the enabling of the shuffle implementation. SHUFFLE_ENABLE
> + * attempts to turn on the implementation, but aborts if it finds
> + * SHUFFLE_FORCE_DISABLE already set.
> + */
> +void page_alloc_shuffle(enum mm_shuffle_ctl ctl)
> +{
> +	if (ctl == SHUFFLE_FORCE_DISABLE)
> +		set_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state);
> +
> +	if (test_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state)) {
> +		if (test_and_clear_bit(SHUFFLE_ENABLE, &shuffle_state))
> +			static_branch_disable(&page_alloc_shuffle_key);
> +	} else if (ctl == SHUFFLE_ENABLE
> +			&& !test_and_set_bit(SHUFFLE_ENABLE, &shuffle_state))
> +		static_branch_enable(&page_alloc_shuffle_key);
> +}

Can this be __meminit?

> +static bool shuffle_param;
> +extern int shuffle_show(char *buffer, const struct kernel_param *kp)
> +{
> +	return sprintf(buffer, "%c\n", test_bit(SHUFFLE_ENABLE, &shuffle_state)
> +			? 'Y' : 'N');
> +}
> +static int shuffle_store(const char *val, const struct kernel_param *kp)
> +{
> +	int rc = param_set_bool(val, kp);
> +
> +	if (rc < 0)
> +		return rc;
> +	if (shuffle_param)
> +		page_alloc_shuffle(SHUFFLE_ENABLE);
> +	else
> +		page_alloc_shuffle(SHUFFLE_FORCE_DISABLE);
> +	return 0;
> +}
> +module_param_call(shuffle, shuffle_store, shuffle_show, &shuffle_param, 0400);
> 
> ...
>
> +/*
> + * Fisher-Yates shuffle the freelist which prescribes iterating through
> + * an array, pfns in this case, and randomly swapping each entry with
> + * another in the span, end_pfn - start_pfn.
> + *
> + * To keep the implementation simple it does not attempt to correct for
> + * sources of bias in the distribution, like modulo bias or
> + * pseudo-random number generator bias. I.e. the expectation is that
> + * this shuffling raises the bar for attacks that exploit the
> + * predictability of page allocations, but need not be a perfect
> + * shuffle.

Reflowing the comment to use all 80 cols would save a line :)

> + */
> +#define SHUFFLE_RETRY 10
> +void __meminit __shuffle_zone(struct zone *z)
> +{
> +	unsigned long i, flags;
> +	unsigned long start_pfn = z->zone_start_pfn;
> +	unsigned long end_pfn = zone_end_pfn(z);
> +	const int order = SHUFFLE_ORDER;
> +	const int order_pages = 1 << order;
> +
> +	spin_lock_irqsave(&z->lock, flags);
> +	start_pfn = ALIGN(start_pfn, order_pages);
> +	for (i = start_pfn; i < end_pfn; i += order_pages) {
> +		unsigned long j;
> +		int migratetype, retry;
> +		struct page *page_i, *page_j;
> +
> +		/*
> +		 * We expect page_i, in the sub-range of a zone being
> +		 * added (@start_pfn to @end_pfn), to more likely be
> +		 * valid compared to page_j randomly selected in the
> +		 * span @zone_start_pfn to @spanned_pages.
> +		 */
> +		page_i = shuffle_valid_page(i, order);
> +		if (!page_i)
> +			continue;
> +
> +		for (retry = 0; retry < SHUFFLE_RETRY; retry++) {
> +			/*
> +			 * Pick a random order aligned page from the
> +			 * start of the zone. Use the *whole* zone here
> +			 * so that if it is freed in tiny pieces that we
> +			 * randomize in the whole zone, not just within
> +			 * those fragments.

Second sentence is hard to parse.

> +			 *
> +			 * Since page_j comes from a potentially sparse
> +			 * address range we want to try a bit harder to
> +			 * find a shuffle point for page_i.
> +			 */

Reflow the comment...

> +			j = z->zone_start_pfn +
> +				ALIGN_DOWN(get_random_long() % z->spanned_pages,
> +						order_pages);
> +			page_j = shuffle_valid_page(j, order);
> +			if (page_j && page_j != page_i)
> +				break;
> +		}
> +		if (retry >= SHUFFLE_RETRY) {
> +			pr_debug("%s: failed to swap %#lx\n", __func__, i);
> +			continue;
> +		}
> +
> +		/*
> +		 * Each migratetype corresponds to its own list, make
> +		 * sure the types match otherwise we're moving pages to
> +		 * lists where they do not belong.
> +		 */

Reflow.

> +		migratetype = get_pageblock_migratetype(page_i);
> +		if (get_pageblock_migratetype(page_j) != migratetype) {
> +			pr_debug("%s: migratetype mismatch %#lx\n", __func__, i);
> +			continue;
> +		}
> +
> +		list_swap(&page_i->lru, &page_j->lru);
> +
> +		pr_debug("%s: swap: %#lx -> %#lx\n", __func__, i, j);
> +
> +		/* take it easy on the zone lock */
> +		if ((i % (100 * order_pages)) == 0) {
> +			spin_unlock_irqrestore(&z->lock, flags);
> +			cond_resched();
> +			spin_lock_irqsave(&z->lock, flags);
> +		}
> +	}
> +	spin_unlock_irqrestore(&z->lock, flags);
> +}
> 
> ...
>
Dan Williams Jan. 31, 2019, 11:04 p.m. UTC | #6
On Thu, Jan 31, 2019 at 2:15 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> On Tue, 29 Jan 2019 21:02:16 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
[..]
> > Introduce shuffle_free_memory(), and its helper shuffle_zone(), to
> > perform a Fisher-Yates shuffle of the page allocator 'free_area' lists
> > when they are initially populated with free memory at boot and at
> > hotplug time. Do this based on either the presence of a
> > page_alloc.shuffle=Y command line parameter, or autodetection of a
> > memory-side-cache (to be added in a follow-on patch).
>
> This is unfortunate from a testing and coverage point of view.  At
> least initially it is desirable that all testers run this feature.
>
> Also, it's unfortunate that enableing the feature requires a reboot.
> What happens if we do away with the boot-time (and maybe hotplug-time)
> randomization and permit the feature to be switched on/off at runtime?

Currently there's the 'shuffle' at memory online time and a random
front-back freeing of max_order pages to the free lists at runtime.
The random front-back freeing behavior would be trivial to toggle at
runtime, however testing showed that the entropy it injects is only
enough to preserve the randomization of the initial 'shuffle', but not
enough entropy to improve cache utilization on its own.

The shuffling could be done dynamically at runtime, but it only
shuffles free memory, the effectiveness is diminished if the workload
has already taken pages off the free list. It's also diminished if the
free lists are polluted with sub MAX_ORDER pages.

The number of caveats that need to be documented makes me skeptical
that runtime triggered shuffling would be reliable.

That said, I see your point about experimentation and validation. What
about allowing it to be settable as a sysfs parameter for
memory-blocks that are being hot-added? That way we know the shuffle
will be effective and the administrator can validate shuffling with a
hot-unplug/replug?

> > The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
> > pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e.
> > 10, 4MB this trades off randomization granularity for time spent
> > shuffling.  MAX_ORDER-1 was chosen to be minimally invasive to the page
> > allocator while still showing memory-side cache behavior improvements,
> > and the expectation that the security implications of finer granularity
> > randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM.
> >
> > The performance impact of the shuffling appears to be in the noise
> > compared to other memory initialization work. Also the bulk of the work
> > is done in the background as a part of deferred_init_memmap().
> >
> > This initial randomization can be undone over time so a follow-on patch
> > is introduced to inject entropy on page free decisions. It is reasonable
> > to ask if the page free entropy is sufficient, but it is not enough due
> > to the in-order initial freeing of pages. At the start of that process
> > putting page1 in front or behind page0 still keeps them close together,
> > page2 is still near page1 and has a high chance of being adjacent. As
> > more pages are added ordering diversity improves, but there is still
> > high page locality for the low address pages and this leads to no
> > significant impact to the cache conflict rate.
> >
> > ...
> >
> >  include/linux/list.h    |   17 ++++
> >  include/linux/mmzone.h  |    4 +
> >  include/linux/shuffle.h |   45 +++++++++++
> >  init/Kconfig            |   23 ++++++
> >  mm/Makefile             |    7 ++
> >  mm/memblock.c           |    1
> >  mm/memory_hotplug.c     |    3 +
> >  mm/page_alloc.c         |    6 +-
> >  mm/shuffle.c            |  188 +++++++++++++++++++++++++++++++++++++++++++++++
>
> Can we get a Documentation update for the new kernel parameter?

Yes.

>
> >
> > ...
> >
> > --- /dev/null
> > +++ b/mm/shuffle.c
> > @@ -0,0 +1,188 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +// Copyright(c) 2018 Intel Corporation. All rights reserved.
> > +
> > +#include <linux/mm.h>
> > +#include <linux/init.h>
> > +#include <linux/mmzone.h>
> > +#include <linux/random.h>
> > +#include <linux/shuffle.h>
>
> Does shuffle.h need to be available to the whole kernel or can we put
> it in mm/?

The wider kernel just needs page_alloc_shuffle() so that
platform-firmware parsing code that detects a memory-side-cache can
enable the shuffle. The rest can be constrained to an mm/ local
header.

>
> > +#include <linux/moduleparam.h>
> > +#include "internal.h"
> > +
> > +DEFINE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
> > +static unsigned long shuffle_state __ro_after_init;
> > +
> > +/*
> > + * Depending on the architecture, module parameter parsing may run
> > + * before, or after the cache detection. SHUFFLE_FORCE_DISABLE prevents,
> > + * or reverts the enabling of the shuffle implementation. SHUFFLE_ENABLE
> > + * attempts to turn on the implementation, but aborts if it finds
> > + * SHUFFLE_FORCE_DISABLE already set.
> > + */
> > +void page_alloc_shuffle(enum mm_shuffle_ctl ctl)
> > +{
> > +     if (ctl == SHUFFLE_FORCE_DISABLE)
> > +             set_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state);
> > +
> > +     if (test_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state)) {
> > +             if (test_and_clear_bit(SHUFFLE_ENABLE, &shuffle_state))
> > +                     static_branch_disable(&page_alloc_shuffle_key);
> > +     } else if (ctl == SHUFFLE_ENABLE
> > +                     && !test_and_set_bit(SHUFFLE_ENABLE, &shuffle_state))
> > +             static_branch_enable(&page_alloc_shuffle_key);
> > +}
>
> Can this be __meminit?

Yes.

>
> > +static bool shuffle_param;
> > +extern int shuffle_show(char *buffer, const struct kernel_param *kp)
> > +{
> > +     return sprintf(buffer, "%c\n", test_bit(SHUFFLE_ENABLE, &shuffle_state)
> > +                     ? 'Y' : 'N');
> > +}
> > +static int shuffle_store(const char *val, const struct kernel_param *kp)
> > +{
> > +     int rc = param_set_bool(val, kp);
> > +
> > +     if (rc < 0)
> > +             return rc;
> > +     if (shuffle_param)
> > +             page_alloc_shuffle(SHUFFLE_ENABLE);
> > +     else
> > +             page_alloc_shuffle(SHUFFLE_FORCE_DISABLE);
> > +     return 0;
> > +}
> > +module_param_call(shuffle, shuffle_store, shuffle_show, &shuffle_param, 0400);
> >
> > ...
> >
> > +/*
> > + * Fisher-Yates shuffle the freelist which prescribes iterating through
> > + * an array, pfns in this case, and randomly swapping each entry with
> > + * another in the span, end_pfn - start_pfn.
> > + *
> > + * To keep the implementation simple it does not attempt to correct for
> > + * sources of bias in the distribution, like modulo bias or
> > + * pseudo-random number generator bias. I.e. the expectation is that
> > + * this shuffling raises the bar for attacks that exploit the
> > + * predictability of page allocations, but need not be a perfect
> > + * shuffle.
>
> Reflowing the comment to use all 80 cols would save a line :)

WIll do.

>
> > + */
> > +#define SHUFFLE_RETRY 10
> > +void __meminit __shuffle_zone(struct zone *z)
> > +{
> > +     unsigned long i, flags;
> > +     unsigned long start_pfn = z->zone_start_pfn;
> > +     unsigned long end_pfn = zone_end_pfn(z);
> > +     const int order = SHUFFLE_ORDER;
> > +     const int order_pages = 1 << order;
> > +
> > +     spin_lock_irqsave(&z->lock, flags);
> > +     start_pfn = ALIGN(start_pfn, order_pages);
> > +     for (i = start_pfn; i < end_pfn; i += order_pages) {
> > +             unsigned long j;
> > +             int migratetype, retry;
> > +             struct page *page_i, *page_j;
> > +
> > +             /*
> > +              * We expect page_i, in the sub-range of a zone being
> > +              * added (@start_pfn to @end_pfn), to more likely be
> > +              * valid compared to page_j randomly selected in the
> > +              * span @zone_start_pfn to @spanned_pages.
> > +              */
> > +             page_i = shuffle_valid_page(i, order);
> > +             if (!page_i)
> > +                     continue;
> > +
> > +             for (retry = 0; retry < SHUFFLE_RETRY; retry++) {
> > +                     /*
> > +                      * Pick a random order aligned page from the
> > +                      * start of the zone. Use the *whole* zone here
> > +                      * so that if it is freed in tiny pieces that we
> > +                      * randomize in the whole zone, not just within
> > +                      * those fragments.
>
> Second sentence is hard to parse.

Earlier versions only arranged to shuffle over non-hole ranges, but
the SHUFFLE_RETRY works around that now. I'll update the comment.

>
> > +                      *
> > +                      * Since page_j comes from a potentially sparse
> > +                      * address range we want to try a bit harder to
> > +                      * find a shuffle point for page_i.
> > +                      */
>
> Reflow the comment...

yup.

>
> > +                     j = z->zone_start_pfn +
> > +                             ALIGN_DOWN(get_random_long() % z->spanned_pages,
> > +                                             order_pages);
> > +                     page_j = shuffle_valid_page(j, order);
> > +                     if (page_j && page_j != page_i)
> > +                             break;
> > +             }
> > +             if (retry >= SHUFFLE_RETRY) {
> > +                     pr_debug("%s: failed to swap %#lx\n", __func__, i);
> > +                     continue;
> > +             }
> > +
> > +             /*
> > +              * Each migratetype corresponds to its own list, make
> > +              * sure the types match otherwise we're moving pages to
> > +              * lists where they do not belong.
> > +              */
>
> Reflow.

ok.
diff mbox series

Patch

diff --git a/include/linux/list.h b/include/linux/list.h
index edb7628e46ed..3dfb8953f241 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -150,6 +150,23 @@  static inline void list_replace_init(struct list_head *old,
 	INIT_LIST_HEAD(old);
 }
 
+/**
+ * list_swap - replace entry1 with entry2 and re-add entry1 at entry2's position
+ * @entry1: the location to place entry2
+ * @entry2: the location to place entry1
+ */
+static inline void list_swap(struct list_head *entry1,
+			     struct list_head *entry2)
+{
+	struct list_head *pos = entry2->prev;
+
+	list_del(entry2);
+	list_replace(entry1, entry2);
+	if (pos == entry1)
+		pos = entry2;
+	list_add(entry1, pos);
+}
+
 /**
  * list_del_init - deletes entry from list and reinitialize it.
  * @entry: the element to delete from the list.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index cc4a507d7ca4..374e9d483382 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1272,6 +1272,10 @@  void sparse_init(void);
 #else
 #define sparse_init()	do {} while (0)
 #define sparse_index_init(_sec, _nid)  do {} while (0)
+static inline int pfn_present(unsigned long pfn)
+{
+	return pfn_valid(pfn);
+}
 #endif /* CONFIG_SPARSEMEM */
 
 /*
diff --git a/include/linux/shuffle.h b/include/linux/shuffle.h
new file mode 100644
index 000000000000..bed2d2901d13
--- /dev/null
+++ b/include/linux/shuffle.h
@@ -0,0 +1,45 @@ 
+// SPDX-License-Identifier: GPL-2.0
+// Copyright(c) 2018 Intel Corporation. All rights reserved.
+#ifndef _MM_SHUFFLE_H
+#define _MM_SHUFFLE_H
+#include <linux/jump_label.h>
+
+enum mm_shuffle_ctl {
+	SHUFFLE_ENABLE,
+	SHUFFLE_FORCE_DISABLE,
+};
+
+#define SHUFFLE_ORDER (MAX_ORDER-1)
+
+#ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR
+DECLARE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
+extern void page_alloc_shuffle(enum mm_shuffle_ctl ctl);
+extern void __shuffle_free_memory(pg_data_t *pgdat);
+static inline void shuffle_free_memory(pg_data_t *pgdat)
+{
+	if (!static_branch_unlikely(&page_alloc_shuffle_key))
+		return;
+	__shuffle_free_memory(pgdat);
+}
+
+extern void __shuffle_zone(struct zone *z);
+static inline void shuffle_zone(struct zone *z)
+{
+	if (!static_branch_unlikely(&page_alloc_shuffle_key))
+		return;
+	__shuffle_zone(z);
+}
+#else
+static inline void shuffle_free_memory(pg_data_t *pgdat)
+{
+}
+
+static inline void shuffle_zone(struct zone *z)
+{
+}
+
+static inline void page_alloc_shuffle(enum mm_shuffle_ctl ctl)
+{
+}
+#endif
+#endif /* _MM_SHUFFLE_H */
diff --git a/init/Kconfig b/init/Kconfig
index d47cb77a220e..cfa199f3e9be 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1714,6 +1714,29 @@  config SLAB_FREELIST_HARDENED
 	  sacrifies to harden the kernel slab allocator against common
 	  freelist exploit methods.
 
+config SHUFFLE_PAGE_ALLOCATOR
+	bool "Page allocator randomization"
+	default SLAB_FREELIST_RANDOM && ACPI_NUMA
+	help
+	  Randomization of the page allocator improves the average
+	  utilization of a direct-mapped memory-side-cache. See section
+	  5.2.27 Heterogeneous Memory Attribute Table (HMAT) in the ACPI
+	  6.2a specification for an example of how a platform advertises
+	  the presence of a memory-side-cache. There are also incidental
+	  security benefits as it reduces the predictability of page
+	  allocations to compliment SLAB_FREELIST_RANDOM, but the
+	  default granularity of shuffling on 4MB (MAX_ORDER) pages is
+	  selected based on cache utilization benefits.
+
+	  While the randomization improves cache utilization it may
+	  negatively impact workloads on platforms without a cache. For
+	  this reason, by default, the randomization is enabled only
+	  after runtime detection of a direct-mapped memory-side-cache.
+	  Otherwise, the randomization may be force enabled with the
+	  'page_alloc.shuffle' kernel command line parameter.
+
+	  Say Y if unsure.
+
 config SLUB_CPU_PARTIAL
 	default y
 	depends on SLUB && SMP
diff --git a/mm/Makefile b/mm/Makefile
index d210cc9d6f80..ac5e5ba78874 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,7 +33,7 @@  mmu-$(CONFIG_MMU)	+= process_vm_access.o
 endif
 
 obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
-			   maccess.o page_alloc.o page-writeback.o \
+			   maccess.o page-writeback.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
@@ -41,6 +41,11 @@  obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   interval_tree.o list_lru.o workingset.o \
 			   debug.o $(mmu-y)
 
+# Give 'page_alloc' its own module-parameter namespace
+page-alloc-y := page_alloc.o
+page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
+
+obj-y += page-alloc.o
 obj-y += init-mm.o
 obj-y += memblock.o
 
diff --git a/mm/memblock.c b/mm/memblock.c
index 022d4cbb3618..c0cfbfae4a03 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -17,6 +17,7 @@ 
 #include <linux/poison.h>
 #include <linux/pfn.h>
 #include <linux/debugfs.h>
+#include <linux/shuffle.h>
 #include <linux/kmemleak.h>
 #include <linux/seq_file.h>
 #include <linux/memblock.h>
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b9a667d36c55..07732be3065e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -23,6 +23,7 @@ 
 #include <linux/highmem.h>
 #include <linux/vmalloc.h>
 #include <linux/ioport.h>
+#include <linux/shuffle.h>
 #include <linux/delay.h>
 #include <linux/migrate.h>
 #include <linux/page-isolation.h>
@@ -895,6 +896,8 @@  int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	zone->zone_pgdat->node_present_pages += onlined_pages;
 	pgdat_resize_unlock(zone->zone_pgdat, &flags);
 
+	shuffle_zone(zone);
+
 	if (onlined_pages) {
 		node_states_set_node(nid, &arg);
 		if (need_zonelists_rebuild)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cde5dac6229a..6208ff744b07 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -61,6 +61,7 @@ 
 #include <linux/sched/rt.h>
 #include <linux/sched/mm.h>
 #include <linux/page_owner.h>
+#include <linux/shuffle.h>
 #include <linux/kthread.h>
 #include <linux/memcontrol.h>
 #include <linux/ftrace.h>
@@ -1752,9 +1753,9 @@  _deferred_grow_zone(struct zone *zone, unsigned int order)
 void __init page_alloc_init_late(void)
 {
 	struct zone *zone;
+	int nid;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
-	int nid;
 
 	/* There will be num_node_state(N_MEMORY) threads */
 	atomic_set(&pgdat_init_n_undone, num_node_state(N_MEMORY));
@@ -1779,6 +1780,9 @@  void __init page_alloc_init_late(void)
 	memblock_discard();
 #endif
 
+	for_each_node_state(nid, N_MEMORY)
+		shuffle_free_memory(NODE_DATA(nid));
+
 	for_each_populated_zone(zone)
 		set_zone_contiguous(zone);
 }
diff --git a/mm/shuffle.c b/mm/shuffle.c
new file mode 100644
index 000000000000..db517cdbaebe
--- /dev/null
+++ b/mm/shuffle.c
@@ -0,0 +1,188 @@ 
+// SPDX-License-Identifier: GPL-2.0
+// Copyright(c) 2018 Intel Corporation. All rights reserved.
+
+#include <linux/mm.h>
+#include <linux/init.h>
+#include <linux/mmzone.h>
+#include <linux/random.h>
+#include <linux/shuffle.h>
+#include <linux/moduleparam.h>
+#include "internal.h"
+
+DEFINE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
+static unsigned long shuffle_state __ro_after_init;
+
+/*
+ * Depending on the architecture, module parameter parsing may run
+ * before, or after the cache detection. SHUFFLE_FORCE_DISABLE prevents,
+ * or reverts the enabling of the shuffle implementation. SHUFFLE_ENABLE
+ * attempts to turn on the implementation, but aborts if it finds
+ * SHUFFLE_FORCE_DISABLE already set.
+ */
+void page_alloc_shuffle(enum mm_shuffle_ctl ctl)
+{
+	if (ctl == SHUFFLE_FORCE_DISABLE)
+		set_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state);
+
+	if (test_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state)) {
+		if (test_and_clear_bit(SHUFFLE_ENABLE, &shuffle_state))
+			static_branch_disable(&page_alloc_shuffle_key);
+	} else if (ctl == SHUFFLE_ENABLE
+			&& !test_and_set_bit(SHUFFLE_ENABLE, &shuffle_state))
+		static_branch_enable(&page_alloc_shuffle_key);
+}
+
+static bool shuffle_param;
+extern int shuffle_show(char *buffer, const struct kernel_param *kp)
+{
+	return sprintf(buffer, "%c\n", test_bit(SHUFFLE_ENABLE, &shuffle_state)
+			? 'Y' : 'N');
+}
+static int shuffle_store(const char *val, const struct kernel_param *kp)
+{
+	int rc = param_set_bool(val, kp);
+
+	if (rc < 0)
+		return rc;
+	if (shuffle_param)
+		page_alloc_shuffle(SHUFFLE_ENABLE);
+	else
+		page_alloc_shuffle(SHUFFLE_FORCE_DISABLE);
+	return 0;
+}
+module_param_call(shuffle, shuffle_store, shuffle_show, &shuffle_param, 0400);
+
+/*
+ * For two pages to be swapped in the shuffle, they must be free (on a
+ * 'free_area' lru), have the same order, and have the same migratetype.
+ */
+static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order)
+{
+	struct page *page;
+
+	/*
+	 * Given we're dealing with randomly selected pfns in a zone we
+	 * need to ask questions like...
+	 */
+
+	/* ...is the pfn even in the memmap? */
+	if (!pfn_valid_within(pfn))
+		return NULL;
+
+	/* ...is the pfn in a present section or a hole? */
+	if (!pfn_present(pfn))
+		return NULL;
+
+	/* ...is the page free and currently on a free_area list? */
+	page = pfn_to_page(pfn);
+	if (!PageBuddy(page))
+		return NULL;
+
+	/*
+	 * ...is the page on the same list as the page we will
+	 * shuffle it with?
+	 */
+	if (page_order(page) != order)
+		return NULL;
+
+	return page;
+}
+
+/*
+ * Fisher-Yates shuffle the freelist which prescribes iterating through
+ * an array, pfns in this case, and randomly swapping each entry with
+ * another in the span, end_pfn - start_pfn.
+ *
+ * To keep the implementation simple it does not attempt to correct for
+ * sources of bias in the distribution, like modulo bias or
+ * pseudo-random number generator bias. I.e. the expectation is that
+ * this shuffling raises the bar for attacks that exploit the
+ * predictability of page allocations, but need not be a perfect
+ * shuffle.
+ */
+#define SHUFFLE_RETRY 10
+void __meminit __shuffle_zone(struct zone *z)
+{
+	unsigned long i, flags;
+	unsigned long start_pfn = z->zone_start_pfn;
+	unsigned long end_pfn = zone_end_pfn(z);
+	const int order = SHUFFLE_ORDER;
+	const int order_pages = 1 << order;
+
+	spin_lock_irqsave(&z->lock, flags);
+	start_pfn = ALIGN(start_pfn, order_pages);
+	for (i = start_pfn; i < end_pfn; i += order_pages) {
+		unsigned long j;
+		int migratetype, retry;
+		struct page *page_i, *page_j;
+
+		/*
+		 * We expect page_i, in the sub-range of a zone being
+		 * added (@start_pfn to @end_pfn), to more likely be
+		 * valid compared to page_j randomly selected in the
+		 * span @zone_start_pfn to @spanned_pages.
+		 */
+		page_i = shuffle_valid_page(i, order);
+		if (!page_i)
+			continue;
+
+		for (retry = 0; retry < SHUFFLE_RETRY; retry++) {
+			/*
+			 * Pick a random order aligned page from the
+			 * start of the zone. Use the *whole* zone here
+			 * so that if it is freed in tiny pieces that we
+			 * randomize in the whole zone, not just within
+			 * those fragments.
+			 *
+			 * Since page_j comes from a potentially sparse
+			 * address range we want to try a bit harder to
+			 * find a shuffle point for page_i.
+			 */
+			j = z->zone_start_pfn +
+				ALIGN_DOWN(get_random_long() % z->spanned_pages,
+						order_pages);
+			page_j = shuffle_valid_page(j, order);
+			if (page_j && page_j != page_i)
+				break;
+		}
+		if (retry >= SHUFFLE_RETRY) {
+			pr_debug("%s: failed to swap %#lx\n", __func__, i);
+			continue;
+		}
+
+		/*
+		 * Each migratetype corresponds to its own list, make
+		 * sure the types match otherwise we're moving pages to
+		 * lists where they do not belong.
+		 */
+		migratetype = get_pageblock_migratetype(page_i);
+		if (get_pageblock_migratetype(page_j) != migratetype) {
+			pr_debug("%s: migratetype mismatch %#lx\n", __func__, i);
+			continue;
+		}
+
+		list_swap(&page_i->lru, &page_j->lru);
+
+		pr_debug("%s: swap: %#lx -> %#lx\n", __func__, i, j);
+
+		/* take it easy on the zone lock */
+		if ((i % (100 * order_pages)) == 0) {
+			spin_unlock_irqrestore(&z->lock, flags);
+			cond_resched();
+			spin_lock_irqsave(&z->lock, flags);
+		}
+	}
+	spin_unlock_irqrestore(&z->lock, flags);
+}
+
+/**
+ * shuffle_free_memory - reduce the predictability of the page allocator
+ * @pgdat: node page data
+ */
+void __meminit __shuffle_free_memory(pg_data_t *pgdat)
+{
+	struct zone *z;
+
+	for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
+		shuffle_zone(z);
+}