[v2,1/4] mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options
diff mbox series

Message ID 20190514143537.10435-2-glider@google.com
State New
Headers show
Series
  • RFC: add init_on_alloc/init_on_free boot options
Related show

Commit Message

Alexander Potapenko May 14, 2019, 2:35 p.m. UTC
The new options are needed to prevent possible information leaks and
make control-flow bugs that depend on uninitialized values more
deterministic.

init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
objects with zeroes. Initialization is done at allocation time at the
places where checks for __GFP_ZERO are performed.

init_on_free=1 makes the kernel initialize freed pages and heap objects
with zeroes upon their deletion. This helps to ensure sensitive data
doesn't leak via use-after-free accesses.

Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
returns zeroed memory. The only exception is slab caches with
constructors. Those are never zero-initialized to preserve their semantics.

For SLOB allocator init_on_free=1 also implies init_on_alloc=1 behavior,
i.e. objects are zeroed at both allocation and deallocation time.
This is done because SLOB may otherwise return multiple freelist pointers
in the allocated object. For SLAB and SLUB enabling either init_on_alloc
or init_on_free leads to one-time initialization of the object.

Both init_on_alloc and init_on_free default to zero, but those defaults
can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
CONFIG_INIT_ON_FREE_DEFAULT_ON.

Slowdown for the new features compared to init_on_free=0,
init_on_alloc=0:

hackbench, init_on_free=1:  +7.62% sys time (st.err 0.74%)
hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)

Linux build with -j12, init_on_free=1:  +8.38% wall time (st.err 0.39%)
Linux build with -j12, init_on_free=1:  +24.42% sys time (st.err 0.52%)
Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)

The slowdown for init_on_free=0, init_on_alloc=0 compared to the
baseline is within the standard error.

Signed-off-by: Alexander Potapenko <glider@google.com>
To: Andrew Morton <akpm@linux-foundation.org>
To: Christoph Lameter <cl@linux.com>
To: Kees Cook <keescook@chromium.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: James Morris <jmorris@namei.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Sandeep Patil <sspatil@android.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Jann Horn <jannh@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-mm@kvack.org
Cc: linux-security-module@vger.kernel.org
Cc: kernel-hardening@lists.openwall.com
---
 v2:
  - unconditionally initialize pages in kernel_init_free_pages()
  - comment from Randy Dunlap: drop 'default false' lines from Kconfig.hardening
---
 .../admin-guide/kernel-parameters.txt         |  8 +++
 drivers/infiniband/core/uverbs_ioctl.c        |  2 +-
 include/linux/mm.h                            | 22 ++++++
 kernel/kexec_core.c                           |  2 +-
 mm/dmapool.c                                  |  2 +-
 mm/page_alloc.c                               | 68 ++++++++++++++++---
 mm/slab.c                                     | 16 ++++-
 mm/slab.h                                     | 16 +++++
 mm/slob.c                                     | 22 +++++-
 mm/slub.c                                     | 27 ++++++--
 net/core/sock.c                               |  2 +-
 security/Kconfig.hardening                    | 14 ++++
 12 files changed, 178 insertions(+), 23 deletions(-)

Comments

Kees Cook May 16, 2019, 4:19 p.m. UTC | #1
On Tue, May 14, 2019 at 04:35:34PM +0200, Alexander Potapenko wrote:
> Slowdown for the new features compared to init_on_free=0,
> init_on_alloc=0:
> 
> hackbench, init_on_free=1:  +7.62% sys time (st.err 0.74%)
> hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)

I wonder if the patch series should be reorganized to introduce
__GFP_NO_AUTOINIT first, so that when the commit with benchmarks appears,
we get the "final" numbers...

> Linux build with -j12, init_on_free=1:  +8.38% wall time (st.err 0.39%)
> Linux build with -j12, init_on_free=1:  +24.42% sys time (st.err 0.52%)
> Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
> Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)

I'm working on reproducing these benchmarks. I'd really like to narrow
down the +24% number here. But it does 

> The slowdown for init_on_free=0, init_on_alloc=0 compared to the
> baseline is within the standard error.

I think the use of static keys here is really great: this is available
by default for anyone that wants to turn it on.

I'm thinking, given the configuable nature of this, it'd be worth adding
a little more detail at boot time. I think maybe a separate patch could
be added to describe the kernel's memory auto-initialization features,
and add something like this to mm_init():

+void __init report_meminit(void)
+{
+	const char *stack;
+
+	if (IS_ENABLED(CONFIG_INIT_STACK_ALL))
+		stack = "all";
+	else if (IS_ENABLED(CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL))
+		stack = "byref_all";
+	else if (IS_ENABLED(CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF))
+		stack = "byref";
+	else if (IS_ENABLED(CONFIG_GCC_PLUGIN_STRUCTLEAK_USER))
+		stack = "__user";
+	else
+		stack = "off";
+
+	/* Report memory auto-initialization states for this boot. */
+	pr_info("mem auto-init: stack:%s, heap alloc:%s, heap free:%s\n",
+		stack, want_init_on_alloc(GFP_KERNEL) ? "on" : "off",
+		want_init_on_free() ? "on" : "off");
+}

To get a boot line like:

	mem auto-init: stack:off, heap alloc:off, heap free:on

And one other thought I had was that in the init_on_free=1 case, there is
a large pause at boot while memory is being cleared. I think it'd be handy
to include a comment about that, just to keep people from being surprised:

diff --git a/init/main.c b/init/main.c
index cf0c3948ce0e..aea278392338 100644
--- a/init/main.c
+++ b/init/main.c
@@ -529,6 +529,8 @@ static void __init mm_init(void)
 	 * bigger than MAX_ORDER unless SPARSEMEM.
 	 */
 	page_ext_init_flatmem();
+	if (want_init_on_free())
+		pr_info("Clearing system memory ...\n");
 	mem_init();
 	kmem_cache_init();
 	pgtable_init();

Beyond these thoughts, I think this series is in good shape.

Andrew (or anyone else) do you have any concerns about this?
Alexander Potapenko May 16, 2019, 4:42 p.m. UTC | #2
From: Kees Cook <keescook@chromium.org>
Date: Thu, May 16, 2019 at 6:20 PM
To: Alexander Potapenko
Cc: <akpm@linux-foundation.org>, <cl@linux.com>,
<kernel-hardening@lists.openwall.com>, Masahiro Yamada, James Morris,
Serge E. Hallyn, Nick Desaulniers, Kostya Serebryany, Dmitry Vyukov,
Sandeep Patil, Laura Abbott, Randy Dunlap, Jann Horn, Mark Rutland,
<linux-mm@kvack.org>, <linux-security-module@vger.kernel.org>

> On Tue, May 14, 2019 at 04:35:34PM +0200, Alexander Potapenko wrote:
> > Slowdown for the new features compared to init_on_free=0,
> > init_on_alloc=0:
> >
> > hackbench, init_on_free=1:  +7.62% sys time (st.err 0.74%)
> > hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)
>
> I wonder if the patch series should be reorganized to introduce
> __GFP_NO_AUTOINIT first, so that when the commit with benchmarks appears,
> we get the "final" numbers...
>
> > Linux build with -j12, init_on_free=1:  +8.38% wall time (st.err 0.39%)
> > Linux build with -j12, init_on_free=1:  +24.42% sys time (st.err 0.52%)
> > Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
> > Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)
>
> I'm working on reproducing these benchmarks. I'd really like to narrow
> down the +24% number here. But it does
I suspect the slowdown of init_on_free is bigger than that of
PAX_SANITIZE_MEMORY, as we've set the goal to have fully zeroed memory
at alloc time.
If we want a mode that only wipes the user data upon free() but
doesn't eliminate all uninit memory, then we can make it faster.
> > The slowdown for init_on_free=0, init_on_alloc=0 compared to the
> > baseline is within the standard error.
>
> I think the use of static keys here is really great: this is available
> by default for anyone that wants to turn it on.
>
> I'm thinking, given the configuable nature of this, it'd be worth adding
> a little more detail at boot time. I think maybe a separate patch could
> be added to describe the kernel's memory auto-initialization features,
> and add something like this to mm_init():
>
> +void __init report_meminit(void)
> +{
> +       const char *stack;
> +
> +       if (IS_ENABLED(CONFIG_INIT_STACK_ALL))
> +               stack = "all";
> +       else if (IS_ENABLED(CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL))
> +               stack = "byref_all";
> +       else if (IS_ENABLED(CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF))
> +               stack = "byref";
> +       else if (IS_ENABLED(CONFIG_GCC_PLUGIN_STRUCTLEAK_USER))
> +               stack = "__user";
> +       else
> +               stack = "off";
> +
> +       /* Report memory auto-initialization states for this boot. */
> +       pr_info("mem auto-init: stack:%s, heap alloc:%s, heap free:%s\n",
> +               stack, want_init_on_alloc(GFP_KERNEL) ? "on" : "off",
> +               want_init_on_free() ? "on" : "off");
> +}
>
> To get a boot line like:
>
>         mem auto-init: stack:off, heap alloc:off, heap free:on
For stack there's no binary on/off, as you can potentially build half
of the kernel with stack instrumentation and another half without it.
We could make the instrumentation insert a static global flag into
each translation unit, but this won't give us any interesting info.

> And one other thought I had was that in the init_on_free=1 case, there is
> a large pause at boot while memory is being cleared. I think it'd be handy
> to include a comment about that, just to keep people from being surprised:
>
> diff --git a/init/main.c b/init/main.c
> index cf0c3948ce0e..aea278392338 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -529,6 +529,8 @@ static void __init mm_init(void)
>          * bigger than MAX_ORDER unless SPARSEMEM.
>          */
>         page_ext_init_flatmem();
> +       if (want_init_on_free())
> +               pr_info("Clearing system memory ...\n");
>         mem_init();
>         kmem_cache_init();
>         pgtable_init();
>
> Beyond these thoughts, I think this series is in good shape.
>
> Andrew (or anyone else) do you have any concerns about this?
>
> --
> Kees Cook
Kees Cook May 16, 2019, 5:03 p.m. UTC | #3
On Thu, May 16, 2019 at 06:42:37PM +0200, Alexander Potapenko wrote:
> I suspect the slowdown of init_on_free is bigger than that of
> PAX_SANITIZE_MEMORY, as we've set the goal to have fully zeroed memory
> at alloc time.
> If we want a mode that only wipes the user data upon free() but
> doesn't eliminate all uninit memory, then we can make it faster.

Yeah, I sent a separate email that discusses this a bit more.

I think the goals of init_on_alloc and init_on_free are likely a bit
different. Given init_on_alloc's much more cache-friendly performance,
I think that it's likely the right way forward for getting to fully zeroed
memory at alloc time. (Though I note that it already includes exclusions:
such tradeoffs won't be unique to init_on_free.)

init_on_free appears to give us similar coverage (but also reduces the
lifetime of unused data), but isn't cache-friendly so it looks to need
a lot more tuning/trade-offs. (Not that we shouldn't include it! It'll
just need a bit more care to be reasonable.)

> > +void __init report_meminit(void)
> > +{
> > +       const char *stack;
> > +
> > +       if (IS_ENABLED(CONFIG_INIT_STACK_ALL))
> > +               stack = "all";
> > +       else if (IS_ENABLED(CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL))
> > +               stack = "byref_all";
> > +       else if (IS_ENABLED(CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF))
> > +               stack = "byref";
> > +       else if (IS_ENABLED(CONFIG_GCC_PLUGIN_STRUCTLEAK_USER))
> > +               stack = "__user";
> > +       else
> > +               stack = "off";
> > +
> > +       /* Report memory auto-initialization states for this boot. */
> > +       pr_info("mem auto-init: stack:%s, heap alloc:%s, heap free:%s\n",
> > +               stack, want_init_on_alloc(GFP_KERNEL) ? "on" : "off",
> > +               want_init_on_free() ? "on" : "off");
> > +}
> >
> > To get a boot line like:
> >
> >         mem auto-init: stack:off, heap alloc:off, heap free:on
> For stack there's no binary on/off, as you can potentially build half
> of the kernel with stack instrumentation and another half without it.
> We could make the instrumentation insert a static global flag into
> each translation unit, but this won't give us any interesting info.

Well, yes, that's technically true, but I think reporting the kernel
config here would make sense. If someone intentionally bypasses the
stack auto-init for portions of the kernel, we can't meaningfully report
it here. There will be exceptions for stack auto-init and heap auto-init.
Kees Cook May 17, 2019, 1:26 a.m. UTC | #4
On Tue, May 14, 2019 at 04:35:34PM +0200, Alexander Potapenko wrote:
> [...]
> diff --git a/mm/slab.h b/mm/slab.h
> index 43ac818b8592..24ae887359b8 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -524,4 +524,20 @@ static inline int cache_random_seq_create(struct kmem_cache *cachep,
> [...]
> +static inline bool slab_want_init_on_free(struct kmem_cache *c)
> +{
> +	if (static_branch_unlikely(&init_on_free))
> +		return !(c->ctor);

BTW, why is this checking for c->ctor here? Shouldn't it not matter for
the free case?

> +	else
> +		return false;
> +}
Michal Hocko May 17, 2019, 2:04 p.m. UTC | #5
On Tue 14-05-19 16:35:34, Alexander Potapenko wrote:
> The new options are needed to prevent possible information leaks and
> make control-flow bugs that depend on uninitialized values more
> deterministic.
> 
> init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
> objects with zeroes. Initialization is done at allocation time at the
> places where checks for __GFP_ZERO are performed.
> 
> init_on_free=1 makes the kernel initialize freed pages and heap objects
> with zeroes upon their deletion. This helps to ensure sensitive data
> doesn't leak via use-after-free accesses.

Why do we need both? The later is more robust because even free memory
cannot be sniffed and the overhead might be shifted from the allocation
context (e.g. to RCU) but why cannot we stick to a single model?
Alexander Potapenko May 17, 2019, 2:11 p.m. UTC | #6
On Fri, May 17, 2019 at 4:04 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 14-05-19 16:35:34, Alexander Potapenko wrote:
> > The new options are needed to prevent possible information leaks and
> > make control-flow bugs that depend on uninitialized values more
> > deterministic.
> >
> > init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
> > objects with zeroes. Initialization is done at allocation time at the
> > places where checks for __GFP_ZERO are performed.
> >
> > init_on_free=1 makes the kernel initialize freed pages and heap objects
> > with zeroes upon their deletion. This helps to ensure sensitive data
> > doesn't leak via use-after-free accesses.
>
> Why do we need both? The later is more robust because even free memory
> cannot be sniffed and the overhead might be shifted from the allocation
> context (e.g. to RCU) but why cannot we stick to a single model?
init_on_free appears to be slower because of cache effects. It's
several % in the best case vs. <1% for init_on_alloc.

> --
> Michal Hocko
> SUSE Labs
Michal Hocko May 17, 2019, 2:20 p.m. UTC | #7
On Fri 17-05-19 16:11:32, Alexander Potapenko wrote:
> On Fri, May 17, 2019 at 4:04 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Tue 14-05-19 16:35:34, Alexander Potapenko wrote:
> > > The new options are needed to prevent possible information leaks and
> > > make control-flow bugs that depend on uninitialized values more
> > > deterministic.
> > >
> > > init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
> > > objects with zeroes. Initialization is done at allocation time at the
> > > places where checks for __GFP_ZERO are performed.
> > >
> > > init_on_free=1 makes the kernel initialize freed pages and heap objects
> > > with zeroes upon their deletion. This helps to ensure sensitive data
> > > doesn't leak via use-after-free accesses.
> >
> > Why do we need both? The later is more robust because even free memory
> > cannot be sniffed and the overhead might be shifted from the allocation
> > context (e.g. to RCU) but why cannot we stick to a single model?
> init_on_free appears to be slower because of cache effects. It's
> several % in the best case vs. <1% for init_on_alloc.

This doesn't really explain why we need both.
Alexander Potapenko May 17, 2019, 2:38 p.m. UTC | #8
On Fri, May 17, 2019 at 3:26 AM Kees Cook <keescook@chromium.org> wrote:
>
> On Tue, May 14, 2019 at 04:35:34PM +0200, Alexander Potapenko wrote:
> > [...]
> > diff --git a/mm/slab.h b/mm/slab.h
> > index 43ac818b8592..24ae887359b8 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -524,4 +524,20 @@ static inline int cache_random_seq_create(struct kmem_cache *cachep,
> > [...]
> > +static inline bool slab_want_init_on_free(struct kmem_cache *c)
> > +{
> > +     if (static_branch_unlikely(&init_on_free))
> > +             return !(c->ctor);
>
> BTW, why is this checking for c->ctor here? Shouldn't it not matter for
> the free case?
It does matter, see e.g. the handling of __OBJECT_POISON in slub.c
If we just return true here, the kernel crashes.
> > +     else
> > +             return false;
> > +}
>
> --
> Kees Cook
Kees Cook May 17, 2019, 4:36 p.m. UTC | #9
On Fri, May 17, 2019 at 04:20:48PM +0200, Michal Hocko wrote:
> On Fri 17-05-19 16:11:32, Alexander Potapenko wrote:
> > On Fri, May 17, 2019 at 4:04 PM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Tue 14-05-19 16:35:34, Alexander Potapenko wrote:
> > > > The new options are needed to prevent possible information leaks and
> > > > make control-flow bugs that depend on uninitialized values more
> > > > deterministic.
> > > >
> > > > init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
> > > > objects with zeroes. Initialization is done at allocation time at the
> > > > places where checks for __GFP_ZERO are performed.
> > > >
> > > > init_on_free=1 makes the kernel initialize freed pages and heap objects
> > > > with zeroes upon their deletion. This helps to ensure sensitive data
> > > > doesn't leak via use-after-free accesses.
> > >
> > > Why do we need both? The later is more robust because even free memory
> > > cannot be sniffed and the overhead might be shifted from the allocation
> > > context (e.g. to RCU) but why cannot we stick to a single model?
> > init_on_free appears to be slower because of cache effects. It's
> > several % in the best case vs. <1% for init_on_alloc.
> 
> This doesn't really explain why we need both.

There are a couple reasons. The first is that once we have hardware with
memory tagging (e.g. arm64's MTE) we'll need both on_alloc and on_free
hooks to do change the tags. With MTE, zeroing comes for "free" with
tagging (though tagging is as slow as zeroing, so it's really the tagging
that is free...), so we'll need to re-use the init_on_free infrastructure.

The second reason is for very paranoid use-cases where in-memory
data lifetime is desired to be minimized. There are various arguments
for/against the realism of the associated threat models, but given that
we'll need the infrastructre for MTE anyway, and there are people who
want wipe-on-free behavior no matter what the performance cost, it seems
reasonable to include it in this series.

All that said, init_on_alloc looks desirable enough that distros will
likely build with it enabled by default (I hope), and the very paranoid
users will switch to (or additionally enable) init_on_free for their
systems.
Michal Hocko May 17, 2019, 5:11 p.m. UTC | #10
On Fri 17-05-19 09:36:36, Kees Cook wrote:
> On Fri, May 17, 2019 at 04:20:48PM +0200, Michal Hocko wrote:
> > On Fri 17-05-19 16:11:32, Alexander Potapenko wrote:
> > > On Fri, May 17, 2019 at 4:04 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Tue 14-05-19 16:35:34, Alexander Potapenko wrote:
> > > > > The new options are needed to prevent possible information leaks and
> > > > > make control-flow bugs that depend on uninitialized values more
> > > > > deterministic.
> > > > >
> > > > > init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
> > > > > objects with zeroes. Initialization is done at allocation time at the
> > > > > places where checks for __GFP_ZERO are performed.
> > > > >
> > > > > init_on_free=1 makes the kernel initialize freed pages and heap objects
> > > > > with zeroes upon their deletion. This helps to ensure sensitive data
> > > > > doesn't leak via use-after-free accesses.
> > > >
> > > > Why do we need both? The later is more robust because even free memory
> > > > cannot be sniffed and the overhead might be shifted from the allocation
> > > > context (e.g. to RCU) but why cannot we stick to a single model?
> > > init_on_free appears to be slower because of cache effects. It's
> > > several % in the best case vs. <1% for init_on_alloc.
> > 
> > This doesn't really explain why we need both.
> 
> There are a couple reasons. The first is that once we have hardware with
> memory tagging (e.g. arm64's MTE) we'll need both on_alloc and on_free
> hooks to do change the tags. With MTE, zeroing comes for "free" with
> tagging (though tagging is as slow as zeroing, so it's really the tagging
> that is free...), so we'll need to re-use the init_on_free infrastructure.

I am not sure I follow, but ...
> 
> The second reason is for very paranoid use-cases where in-memory
> data lifetime is desired to be minimized. There are various arguments
> for/against the realism of the associated threat models, but given that
> we'll need the infrastructre for MTE anyway, and there are people who
> want wipe-on-free behavior no matter what the performance cost, it seems
> reasonable to include it in this series.
> 
> All that said, init_on_alloc looks desirable enough that distros will
> likely build with it enabled by default (I hope), and the very paranoid
> users will switch to (or additionally enable) init_on_free for their
> systems.

... this should all be part of the changelog.

Patch
diff mbox series

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 08df58805703..cece9a56ddb1 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1673,6 +1673,14 @@ 
 
 	initrd=		[BOOT] Specify the location of the initial ramdisk
 
+	init_on_alloc=	[MM] Fill newly allocated pages and heap objects with
+			zeroes.
+			Format: 0 | 1
+			Default set by CONFIG_INIT_ON_ALLOC_DEFAULT_ON.
+	init_on_free=	[MM] Fill freed pages and heap objects with zeroes.
+			Format: 0 | 1
+			Default set by CONFIG_INIT_ON_FREE_DEFAULT_ON.
+
 	init_pkru=	[x86] Specify the default memory protection keys rights
 			register contents for all processes.  0x55555554 by
 			default (disallow access to all but pkey 0).  Can
diff --git a/drivers/infiniband/core/uverbs_ioctl.c b/drivers/infiniband/core/uverbs_ioctl.c
index 829b0c6944d8..61758201d9b2 100644
--- a/drivers/infiniband/core/uverbs_ioctl.c
+++ b/drivers/infiniband/core/uverbs_ioctl.c
@@ -127,7 +127,7 @@  __malloc void *_uverbs_alloc(struct uverbs_attr_bundle *bundle, size_t size,
 	res = (void *)pbundle->internal_buffer + pbundle->internal_used;
 	pbundle->internal_used =
 		ALIGN(new_used, sizeof(*pbundle->internal_buffer));
-	if (flags & __GFP_ZERO)
+	if (want_init_on_alloc(flags))
 		memset(res, 0, size);
 	return res;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 083d7b4863ed..18d96f1d07c5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2610,6 +2610,28 @@  static inline void kernel_poison_pages(struct page *page, int numpages,
 					int enable) { }
 #endif
 
+#ifdef CONFIG_INIT_ON_ALLOC_DEFAULT_ON
+DECLARE_STATIC_KEY_TRUE(init_on_alloc);
+#else
+DECLARE_STATIC_KEY_FALSE(init_on_alloc);
+#endif
+static inline bool want_init_on_alloc(gfp_t flags)
+{
+	if (static_branch_unlikely(&init_on_alloc))
+		return true;
+	return flags & __GFP_ZERO;
+}
+
+#ifdef CONFIG_INIT_ON_FREE_DEFAULT_ON
+DECLARE_STATIC_KEY_TRUE(init_on_free);
+#else
+DECLARE_STATIC_KEY_FALSE(init_on_free);
+#endif
+static inline bool want_init_on_free(void)
+{
+	return static_branch_unlikely(&init_on_free);
+}
+
 extern bool _debug_pagealloc_enabled;
 
 static inline bool debug_pagealloc_enabled(void)
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index fd5c95ff9251..2f75dd0d0d81 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -315,7 +315,7 @@  static struct page *kimage_alloc_pages(gfp_t gfp_mask, unsigned int order)
 		arch_kexec_post_alloc_pages(page_address(pages), count,
 					    gfp_mask);
 
-		if (gfp_mask & __GFP_ZERO)
+		if (want_init_on_alloc(gfp_mask))
 			for (i = 0; i < count; i++)
 				clear_highpage(pages + i);
 	}
diff --git a/mm/dmapool.c b/mm/dmapool.c
index 76a160083506..493d151067cb 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -381,7 +381,7 @@  void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
 #endif
 	spin_unlock_irqrestore(&pool->lock, flags);
 
-	if (mem_flags & __GFP_ZERO)
+	if (want_init_on_alloc(mem_flags))
 		memset(retval, 0, pool->size);
 
 	return retval;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 59661106da16..463c681a3633 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -133,6 +133,48 @@  unsigned long totalcma_pages __read_mostly;
 
 int percpu_pagelist_fraction;
 gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
+#ifdef CONFIG_INIT_ON_ALLOC_DEFAULT_ON
+DEFINE_STATIC_KEY_TRUE(init_on_alloc);
+#else
+DEFINE_STATIC_KEY_FALSE(init_on_alloc);
+#endif
+#ifdef CONFIG_INIT_ON_FREE_DEFAULT_ON
+DEFINE_STATIC_KEY_TRUE(init_on_free);
+#else
+DEFINE_STATIC_KEY_FALSE(init_on_free);
+#endif
+
+static int __init early_init_on_alloc(char *buf)
+{
+	int ret;
+	bool bool_result;
+
+	if (!buf)
+		return -EINVAL;
+	ret = kstrtobool(buf, &bool_result);
+	if (bool_result)
+		static_branch_enable(&init_on_alloc);
+	else
+		static_branch_disable(&init_on_alloc);
+	return ret;
+}
+early_param("init_on_alloc", early_init_on_alloc);
+
+static int __init early_init_on_free(char *buf)
+{
+	int ret;
+	bool bool_result;
+
+	if (!buf)
+		return -EINVAL;
+	ret = kstrtobool(buf, &bool_result);
+	if (bool_result)
+		static_branch_enable(&init_on_free);
+	else
+		static_branch_disable(&init_on_free);
+	return ret;
+}
+early_param("init_on_free", early_init_on_free);
 
 /*
  * A cached value of the page's pageblock's migratetype, used when the page is
@@ -1092,6 +1134,14 @@  static int free_tail_pages_check(struct page *head_page, struct page *page)
 	return ret;
 }
 
+static void kernel_init_free_pages(struct page *page, int numpages)
+{
+	int i;
+
+	for (i = 0; i < numpages; i++)
+		clear_highpage(page + i);
+}
+
 static __always_inline bool free_pages_prepare(struct page *page,
 					unsigned int order, bool check_free)
 {
@@ -1144,9 +1194,10 @@  static __always_inline bool free_pages_prepare(struct page *page,
 	}
 	arch_free_page(page, order);
 	kernel_poison_pages(page, 1 << order, 0);
+	if (want_init_on_free())
+		kernel_init_free_pages(page, 1 << order);
 	if (debug_pagealloc_enabled())
 		kernel_map_pages(page, 1 << order, 0);
-
 	kasan_free_nondeferred_pages(page, order);
 
 	return true;
@@ -1452,8 +1503,10 @@  meminit_pfn_in_nid(unsigned long pfn, int node,
 void __init memblock_free_pages(struct page *page, unsigned long pfn,
 							unsigned int order)
 {
-	if (early_page_uninitialised(pfn))
+	if (early_page_uninitialised(pfn)) {
+		kernel_init_free_pages(page, 1 << order);
 		return;
+	}
 	__free_pages_core(page, order);
 }
 
@@ -1971,8 +2024,8 @@  static inline int check_new_page(struct page *page)
 
 static inline bool free_pages_prezeroed(void)
 {
-	return IS_ENABLED(CONFIG_PAGE_POISONING_ZERO) &&
-		page_poisoning_enabled();
+	return (IS_ENABLED(CONFIG_PAGE_POISONING_ZERO) &&
+		page_poisoning_enabled()) || want_init_on_free();
 }
 
 #ifdef CONFIG_DEBUG_VM
@@ -2026,13 +2079,10 @@  inline void post_alloc_hook(struct page *page, unsigned int order,
 static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
 							unsigned int alloc_flags)
 {
-	int i;
-
 	post_alloc_hook(page, order, gfp_flags);
 
-	if (!free_pages_prezeroed() && (gfp_flags & __GFP_ZERO))
-		for (i = 0; i < (1 << order); i++)
-			clear_highpage(page + i);
+	if (!free_pages_prezeroed() && want_init_on_alloc(gfp_flags))
+		kernel_init_free_pages(page, 1 << order);
 
 	if (order && (gfp_flags & __GFP_COMP))
 		prep_compound_page(page, order);
diff --git a/mm/slab.c b/mm/slab.c
index 284ab737faee..d00e9de26a45 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1855,6 +1855,14 @@  static bool set_objfreelist_slab_cache(struct kmem_cache *cachep,
 
 	cachep->num = 0;
 
+	/*
+	 * If slab auto-initialization on free is enabled, store the freelist
+	 * off-slab, so that its contents don't end up in one of the allocated
+	 * objects.
+	 */
+	if (unlikely(slab_want_init_on_free(cachep)))
+		return false;
+
 	if (cachep->ctor || flags & SLAB_TYPESAFE_BY_RCU)
 		return false;
 
@@ -3294,7 +3302,7 @@  slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 	local_irq_restore(save_flags);
 	ptr = cache_alloc_debugcheck_after(cachep, flags, ptr, caller);
 
-	if (unlikely(flags & __GFP_ZERO) && ptr)
+	if (unlikely(slab_want_init_on_alloc(flags, cachep)) && ptr)
 		memset(ptr, 0, cachep->object_size);
 
 	slab_post_alloc_hook(cachep, flags, 1, &ptr);
@@ -3351,7 +3359,7 @@  slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
 	objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller);
 	prefetchw(objp);
 
-	if (unlikely(flags & __GFP_ZERO) && objp)
+	if (unlikely(slab_want_init_on_alloc(flags, cachep)) && objp)
 		memset(objp, 0, cachep->object_size);
 
 	slab_post_alloc_hook(cachep, flags, 1, &objp);
@@ -3472,6 +3480,8 @@  void ___cache_free(struct kmem_cache *cachep, void *objp,
 	struct array_cache *ac = cpu_cache_get(cachep);
 
 	check_irq_off();
+	if (unlikely(slab_want_init_on_free(cachep)))
+		memset(objp, 0, cachep->object_size);
 	kmemleak_free_recursive(objp, cachep->flags);
 	objp = cache_free_debugcheck(cachep, objp, caller);
 
@@ -3559,7 +3569,7 @@  int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 	cache_alloc_debugcheck_after_bulk(s, flags, size, p, _RET_IP_);
 
 	/* Clear memory outside IRQ disabled section */
-	if (unlikely(flags & __GFP_ZERO))
+	if (unlikely(slab_want_init_on_alloc(flags, s)))
 		for (i = 0; i < size; i++)
 			memset(p[i], 0, s->object_size);
 
diff --git a/mm/slab.h b/mm/slab.h
index 43ac818b8592..24ae887359b8 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -524,4 +524,20 @@  static inline int cache_random_seq_create(struct kmem_cache *cachep,
 static inline void cache_random_seq_destroy(struct kmem_cache *cachep) { }
 #endif /* CONFIG_SLAB_FREELIST_RANDOM */
 
+static inline bool slab_want_init_on_alloc(gfp_t flags, struct kmem_cache *c)
+{
+	if (static_branch_unlikely(&init_on_alloc))
+		return !(c->ctor);
+	else
+		return flags & __GFP_ZERO;
+}
+
+static inline bool slab_want_init_on_free(struct kmem_cache *c)
+{
+	if (static_branch_unlikely(&init_on_free))
+		return !(c->ctor);
+	else
+		return false;
+}
+
 #endif /* MM_SLAB_H */
diff --git a/mm/slob.c b/mm/slob.c
index 307c2c9feb44..351d3dfee000 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -212,6 +212,19 @@  static void slob_free_pages(void *b, int order)
 	free_pages((unsigned long)b, order);
 }
 
+/*
+ * init_on_free=1 also implies initialization at allocation time.
+ * This is because newly allocated objects may contain freelist pointers
+ * somewhere in the middle.
+ */
+static inline bool slob_want_init_on_alloc(gfp_t flags, struct kmem_cache *c)
+{
+	if (static_branch_unlikely(&init_on_alloc) ||
+	    static_branch_unlikely(&init_on_free))
+		return c ? (!c->ctor) : true;
+	return flags & __GFP_ZERO;
+}
+
 /*
  * Allocate a slob block within a given slob_page sp.
  */
@@ -330,8 +343,6 @@  static void *slob_alloc(size_t size, gfp_t gfp, int align, int node)
 		BUG_ON(!b);
 		spin_unlock_irqrestore(&slob_lock, flags);
 	}
-	if (unlikely(gfp & __GFP_ZERO))
-		memset(b, 0, size);
 	return b;
 }
 
@@ -366,6 +377,9 @@  static void slob_free(void *block, int size)
 		return;
 	}
 
+	if (unlikely(want_init_on_free()))
+		memset(block, 0, size);
+
 	if (!slob_page_free(sp)) {
 		/* This slob page is about to become partially free. Easy! */
 		sp->units = units;
@@ -461,6 +475,8 @@  __do_kmalloc_node(size_t size, gfp_t gfp, int node, unsigned long caller)
 	}
 
 	kmemleak_alloc(ret, size, 1, gfp);
+	if (unlikely(slob_want_init_on_alloc(gfp, 0)))
+		memset(ret, 0, size);
 	return ret;
 }
 
@@ -559,6 +575,8 @@  static void *slob_alloc_node(struct kmem_cache *c, gfp_t flags, int node)
 		WARN_ON_ONCE(flags & __GFP_ZERO);
 		c->ctor(b);
 	}
+	if (unlikely(slob_want_init_on_alloc(flags, c)))
+		memset(b, 0, c->size);
 
 	kmemleak_alloc_recursive(b, c->size, 1, c->flags, flags);
 	return b;
diff --git a/mm/slub.c b/mm/slub.c
index 6b28cd2b5a58..01424e910800 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1423,6 +1423,19 @@  static __always_inline bool slab_free_hook(struct kmem_cache *s, void *x)
 static inline bool slab_free_freelist_hook(struct kmem_cache *s,
 					   void **head, void **tail)
 {
+
+	void *object;
+	void *next = *head;
+	void *old_tail = *tail ? *tail : *head;
+
+	if (slab_want_init_on_free(s))
+		do {
+			object = next;
+			next = get_freepointer(s, object);
+			memset(object, 0, s->size);
+			set_freepointer(s, object, next);
+		} while (object != old_tail);
+
 /*
  * Compiler cannot detect this function can be removed if slab_free_hook()
  * evaluates to nothing.  Thus, catch all relevant config debug options here.
@@ -1432,9 +1445,7 @@  static inline bool slab_free_freelist_hook(struct kmem_cache *s,
 	defined(CONFIG_DEBUG_OBJECTS_FREE) ||	\
 	defined(CONFIG_KASAN)
 
-	void *object;
-	void *next = *head;
-	void *old_tail = *tail ? *tail : *head;
+	next = *head;
 
 	/* Head and tail of the reconstructed freelist */
 	*head = NULL;
@@ -2740,8 +2751,14 @@  static __always_inline void *slab_alloc_node(struct kmem_cache *s,
 		prefetch_freepointer(s, next_object);
 		stat(s, ALLOC_FASTPATH);
 	}
+	/*
+	 * If the object has been wiped upon free, make sure it's fully
+	 * initialized by zeroing out freelist pointer.
+	 */
+	if (slab_want_init_on_free(s))
+		*(void **)object = 0;
 
-	if (unlikely(gfpflags & __GFP_ZERO) && object)
+	if (unlikely(slab_want_init_on_alloc(gfpflags, s)) && object)
 		memset(object, 0, s->object_size);
 
 	slab_post_alloc_hook(s, gfpflags, 1, &object);
@@ -3163,7 +3180,7 @@  int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 	local_irq_enable();
 
 	/* Clear memory outside IRQ disabled fastpath loop */
-	if (unlikely(flags & __GFP_ZERO)) {
+	if (unlikely(slab_want_init_on_alloc(flags, s))) {
 		int j;
 
 		for (j = 0; j < i; j++)
diff --git a/net/core/sock.c b/net/core/sock.c
index 75b1c950b49f..9ceb90c875bc 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1602,7 +1602,7 @@  static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
 		sk = kmem_cache_alloc(slab, priority & ~__GFP_ZERO);
 		if (!sk)
 			return sk;
-		if (priority & __GFP_ZERO)
+		if (want_init_on_alloc(priority))
 			sk_prot_clear_nulls(sk, prot->obj_size);
 	} else
 		sk = kmalloc(prot->obj_size, priority);
diff --git a/security/Kconfig.hardening b/security/Kconfig.hardening
index 0a1d4ca314f4..87883e3e3c2a 100644
--- a/security/Kconfig.hardening
+++ b/security/Kconfig.hardening
@@ -159,6 +159,20 @@  config STACKLEAK_RUNTIME_DISABLE
 	  runtime to control kernel stack erasing for kernels built with
 	  CONFIG_GCC_PLUGIN_STACKLEAK.
 
+config INIT_ON_ALLOC_DEFAULT_ON
+	bool "Set init_on_alloc=1 by default"
+	help
+	  Enable init_on_alloc=1 by default, making the kernel initialize every
+	  page and heap allocation with zeroes.
+	  init_on_alloc can be overridden via command line.
+
+config INIT_ON_FREE_DEFAULT_ON
+	bool "Set init_on_free=1 by default"
+	help
+	  Enable init_on_free=1 by default, making the kernel initialize freed
+	  pages and slab memory with zeroes.
+	  init_on_free can be overridden via command line.
+
 endmenu
 
 endmenu