[0/2] Don’t leave executable TLB entries to freed pages

Message ID	20181128000754.18056-1-rick.p.edgecombe@intel.com (mailing list archive)
Headers	show Return-Path: <kernel-hardening-return-14545-patchwork-kernel-hardening=patchwork.kernel.org@lists.openwall.com> Mailing-List: contact kernel-hardening-help@lists.openwall.com; run by ezmlm Precedence: bulk From: Rick Edgecombe <rick.p.edgecombe@intel.com> To: akpm@linux-foundation.org, luto@kernel.org, will.deacon@arm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-hardening@lists.openwall.com, naveen.n.rao@linux.vnet.ibm.com, anil.s.keshavamurthy@intel.com, davem@davemloft.net, mhiramat@kernel.org, rostedt@goodmis.org, mingo@redhat.com, ast@kernel.org, daniel@iogearbox.net, jeyu@kernel.org, netdev@vger.kernel.org, ard.biesheuvel@linaro.org, jannh@google.com Cc: kristen@linux.intel.com, dave.hansen@intel.com, deneen.t.dock@intel.com, Rick Edgecombe <rick.p.edgecombe@intel.com> Subject: =?utf-8?q?=5BPATCH_0/2=5D_Don=E2=80=99t_leave_executable_TLB_entrie?= =?utf-8?q?s_to_freed_pages?= Date: Tue, 27 Nov 2018 16:07:52 -0800 Message-Id: <20181128000754.18056-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	Don’t leave executable TLB entries to freed pages \| expand [0/2] Don’t leave executable TLB entries to freed pages [1/2] vmalloc: New flag for flush before releasing pages [2/2] x86/modules: Make x86 allocs to flush when free

Rick Edgecombe Nov. 28, 2018, 12:07 a.m. UTC

Sometimes when memory is freed via the module subsystem, an executable
permissioned TLB entry can remain to a freed page. If the page is re-used to
back an address that will receive data from userspace, it can result in user
data being mapped as executable in the kernel. The root of this behavior is
vfree lazily flushing the TLB, but not lazily freeing the underlying pages. 

There are sort of three categories of this which show up across modules, bpf,
kprobes and ftrace:

1. When executable memory is touched and then immediatly freed

   This shows up in a couple error conditions in the module loader and BPF JIT
   compiler.

2. When executable memory is set to RW right before being freed

   In this case (on x86 and probably others) there will be a TLB flush when its
   set to RW and so since the pages are not touched between setting the
   flush and the free, it should not be in the TLB in most cases. So this
   category is not as big of a concern. However, techinically there is still a
   race where an attacker could try to keep it alive for a short window with a
   well timed out-of-bound read or speculative read, so ideally this could be
   blocked as well.

3. When executable memory is freed in an interrupt

   At least one example of this is the freeing of init sections in the module
   loader. Since vmalloc reuses the allocation for the work queue linked list
   node for the deferred frees, the memory actually gets touched as part of the
   vfree operation and so returns to the TLB even after the flush from resetting
   the permissions.

I have only actually tested category 1, and identified 2 and 3 just from reading
the code.

To catch all of these, module_alloc for x86 is changed to use a new flag that
instructs the unmap operation to flush the TLB before freeing the pages.

If this solution seems good I can plug the flag in for other architectures that
define PAGE_KERNEL_EXEC.


Rick Edgecombe (2):
  vmalloc: New flag for flush before releasing pages
  x86/modules: Make x86 allocs to flush when free

 arch/x86/kernel/module.c |  4 ++--
 include/linux/vmalloc.h  |  1 +
 mm/vmalloc.c             | 13 +++++++++++--
 3 files changed, 14 insertions(+), 4 deletions(-)

Nadav Amit Nov. 28, 2018, 1:06 a.m. UTC | #1

> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:
> 
> Sometimes when memory is freed via the module subsystem, an executable
> permissioned TLB entry can remain to a freed page. If the page is re-used to
> back an address that will receive data from userspace, it can result in user
> data being mapped as executable in the kernel. The root of this behavior is
> vfree lazily flushing the TLB, but not lazily freeing the underlying pages. 
> 
> There are sort of three categories of this which show up across modules, bpf,
> kprobes and ftrace:
> 
> 1. When executable memory is touched and then immediatly freed
> 
>   This shows up in a couple error conditions in the module loader and BPF JIT
>   compiler.

Interesting!

Note that this may cause conflict with "x86: avoid W^X being broken during
modules loading”, which I recently submitted.

Nadav Amit Nov. 28, 2018, 1:21 a.m. UTC | #2

> On Nov 27, 2018, at 5:06 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> 
>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:
>> 
>> Sometimes when memory is freed via the module subsystem, an executable
>> permissioned TLB entry can remain to a freed page. If the page is re-used to
>> back an address that will receive data from userspace, it can result in user
>> data being mapped as executable in the kernel. The root of this behavior is
>> vfree lazily flushing the TLB, but not lazily freeing the underlying pages. 
>> 
>> There are sort of three categories of this which show up across modules, bpf,
>> kprobes and ftrace:
>> 
>> 1. When executable memory is touched and then immediatly freed
>> 
>>  This shows up in a couple error conditions in the module loader and BPF JIT
>>  compiler.
> 
> Interesting!
> 
> Note that this may cause conflict with "x86: avoid W^X being broken during
> modules loading”, which I recently submitted.

I actually have not looked on the vmalloc() code too much recent, but it
seems … strange:

  void vm_unmap_aliases(void)
  {       

  ...
  	mutex_lock(&vmap_purge_lock);
  	purge_fragmented_blocks_allcpus();
  	if (!__purge_vmap_area_lazy(start, end) && flush)
  		flush_tlb_kernel_range(start, end);
  	mutex_unlock(&vmap_purge_lock);
  }

Since __purge_vmap_area_lazy() releases the memory, it seems there is a time
window between the release of the region and the TLB flush, in which the
area can be allocated for another purpose. This can result in a
(theoretical) correctness issue. No?

Will Deacon Nov. 28, 2018, 9:57 a.m. UTC | #3

On Tue, Nov 27, 2018 at 05:21:08PM -0800, Nadav Amit wrote:
> > On Nov 27, 2018, at 5:06 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> > 
> >> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:
> >> 
> >> Sometimes when memory is freed via the module subsystem, an executable
> >> permissioned TLB entry can remain to a freed page. If the page is re-used to
> >> back an address that will receive data from userspace, it can result in user
> >> data being mapped as executable in the kernel. The root of this behavior is
> >> vfree lazily flushing the TLB, but not lazily freeing the underlying pages. 
> >> 
> >> There are sort of three categories of this which show up across modules, bpf,
> >> kprobes and ftrace:
> >> 
> >> 1. When executable memory is touched and then immediatly freed
> >> 
> >>  This shows up in a couple error conditions in the module loader and BPF JIT
> >>  compiler.
> > 
> > Interesting!
> > 
> > Note that this may cause conflict with "x86: avoid W^X being broken during
> > modules loading”, which I recently submitted.
> 
> I actually have not looked on the vmalloc() code too much recent, but it
> seems … strange:
> 
>   void vm_unmap_aliases(void)
>   {       
> 
>   ...
>   	mutex_lock(&vmap_purge_lock);
>   	purge_fragmented_blocks_allcpus();
>   	if (!__purge_vmap_area_lazy(start, end) && flush)
>   		flush_tlb_kernel_range(start, end);
>   	mutex_unlock(&vmap_purge_lock);
>   }
> 
> Since __purge_vmap_area_lazy() releases the memory, it seems there is a time
> window between the release of the region and the TLB flush, in which the
> area can be allocated for another purpose. This can result in a
> (theoretical) correctness issue. No?

If __purge_vmap_area_lazy() returns false, then it hasn't freed the memory,
so we only invalidate the TLB if 'flush' is true in that case. If
__purge_vmap_area_lazy() returns true instead, then it takes care of the TLB
invalidation before the freeing.

Will

Nadav Amit Nov. 28, 2018, 6:29 p.m. UTC | #4

> On Nov 28, 2018, at 1:57 AM, Will Deacon <will.deacon@arm.com> wrote:
> 
> On Tue, Nov 27, 2018 at 05:21:08PM -0800, Nadav Amit wrote:
>>> On Nov 27, 2018, at 5:06 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>>> 
>>>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:
>>>> 
>>>> Sometimes when memory is freed via the module subsystem, an executable
>>>> permissioned TLB entry can remain to a freed page. If the page is re-used to
>>>> back an address that will receive data from userspace, it can result in user
>>>> data being mapped as executable in the kernel. The root of this behavior is
>>>> vfree lazily flushing the TLB, but not lazily freeing the underlying pages. 
>>>> 
>>>> There are sort of three categories of this which show up across modules, bpf,
>>>> kprobes and ftrace:
>>>> 
>>>> 1. When executable memory is touched and then immediatly freed
>>>> 
>>>> This shows up in a couple error conditions in the module loader and BPF JIT
>>>> compiler.
>>> 
>>> Interesting!
>>> 
>>> Note that this may cause conflict with "x86: avoid W^X being broken during
>>> modules loading”, which I recently submitted.
>> 
>> I actually have not looked on the vmalloc() code too much recent, but it
>> seems … strange:
>> 
>>  void vm_unmap_aliases(void)
>>  {       
>> 
>>  ...
>>  	mutex_lock(&vmap_purge_lock);
>>  	purge_fragmented_blocks_allcpus();
>>  	if (!__purge_vmap_area_lazy(start, end) && flush)
>>  		flush_tlb_kernel_range(start, end);
>>  	mutex_unlock(&vmap_purge_lock);
>>  }
>> 
>> Since __purge_vmap_area_lazy() releases the memory, it seems there is a time
>> window between the release of the region and the TLB flush, in which the
>> area can be allocated for another purpose. This can result in a
>> (theoretical) correctness issue. No?
> 
> If __purge_vmap_area_lazy() returns false, then it hasn't freed the memory,
> so we only invalidate the TLB if 'flush' is true in that case. If
> __purge_vmap_area_lazy() returns true instead, then it takes care of the TLB
> invalidation before the freeing.

Right. Sorry for my misunderstanding.

Thanks,
Nadav

Masami Hiramatsu (Google) Nov. 29, 2018, 2:06 p.m. UTC | #5

On Tue, 27 Nov 2018 16:07:52 -0800
Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:

> Sometimes when memory is freed via the module subsystem, an executable
> permissioned TLB entry can remain to a freed page. If the page is re-used to
> back an address that will receive data from userspace, it can result in user
> data being mapped as executable in the kernel. The root of this behavior is
> vfree lazily flushing the TLB, but not lazily freeing the underlying pages. 

Good catch!

> 
> There are sort of three categories of this which show up across modules, bpf,
> kprobes and ftrace:

For x86-64 kprobe, it sets the page NX and after that RW, and then release
via module_memfree. So I'm not sure it really happens on kprobes. (Of course
the default memory allocator is simpler so it may happen on other archs) But
interesting fixes.

Thank you,


> 
> 1. When executable memory is touched and then immediatly freed
> 
>    This shows up in a couple error conditions in the module loader and BPF JIT
>    compiler.
> 
> 2. When executable memory is set to RW right before being freed
> 
>    In this case (on x86 and probably others) there will be a TLB flush when its
>    set to RW and so since the pages are not touched between setting the
>    flush and the free, it should not be in the TLB in most cases. So this
>    category is not as big of a concern. However, techinically there is still a
>    race where an attacker could try to keep it alive for a short window with a
>    well timed out-of-bound read or speculative read, so ideally this could be
>    blocked as well.
> 
> 3. When executable memory is freed in an interrupt
> 
>    At least one example of this is the freeing of init sections in the module
>    loader. Since vmalloc reuses the allocation for the work queue linked list
>    node for the deferred frees, the memory actually gets touched as part of the
>    vfree operation and so returns to the TLB even after the flush from resetting
>    the permissions.
> 
> I have only actually tested category 1, and identified 2 and 3 just from reading
> the code.
> 
> To catch all of these, module_alloc for x86 is changed to use a new flag that
> instructs the unmap operation to flush the TLB before freeing the pages.
> 
> If this solution seems good I can plug the flag in for other architectures that
> define PAGE_KERNEL_EXEC.
> 
> 
> Rick Edgecombe (2):
>   vmalloc: New flag for flush before releasing pages
>   x86/modules: Make x86 allocs to flush when free
> 
>  arch/x86/kernel/module.c |  4 ++--
>  include/linux/vmalloc.h  |  1 +
>  mm/vmalloc.c             | 13 +++++++++++--
>  3 files changed, 14 insertions(+), 4 deletions(-)
> 
> -- 
> 2.17.1
>

Rick Edgecombe Nov. 29, 2018, 6:49 p.m. UTC | #6

On Thu, 2018-11-29 at 23:06 +0900, Masami Hiramatsu wrote:
> On Tue, 27 Nov 2018 16:07:52 -0800
> Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:
> 
> > Sometimes when memory is freed via the module subsystem, an executable
> > permissioned TLB entry can remain to a freed page. If the page is re-used to
> > back an address that will receive data from userspace, it can result in user
> > data being mapped as executable in the kernel. The root of this behavior is
> > vfree lazily flushing the TLB, but not lazily freeing the underlying pages. 
> 
> Good catch!
> 
> > 
> > There are sort of three categories of this which show up across modules,
> > bpf,
> > kprobes and ftrace:
> 
> For x86-64 kprobe, it sets the page NX and after that RW, and then release
> via module_memfree. So I'm not sure it really happens on kprobes. (Of course
> the default memory allocator is simpler so it may happen on other archs) But
> interesting fixes.
Yes, I think you are right, it should not leave an executable TLB entry in this
case. Ftrace actually does this on x86 as well.

Is there some other reason for calling set_memory_nx that should apply elsewhere
for module users? Or could it be removed in the case of this patch to centralize
the behavior?

Thanks,

Rick

> Thank you,
> 
> 
> > 
> > 1. When executable memory is touched and then immediatly freed
> > 
> >    This shows up in a couple error conditions in the module loader and BPF
> > JIT
> >    compiler.
> > 
> > 2. When executable memory is set to RW right before being freed
> > 
> >    In this case (on x86 and probably others) there will be a TLB flush when
> > its
> >    set to RW and so since the pages are not touched between setting the
> >    flush and the free, it should not be in the TLB in most cases. So this
> >    category is not as big of a concern. However, techinically there is still
> > a
> >    race where an attacker could try to keep it alive for a short window with
> > a
> >    well timed out-of-bound read or speculative read, so ideally this could
> > be
> >    blocked as well.
> > 
> > 3. When executable memory is freed in an interrupt
> > 
> >    At least one example of this is the freeing of init sections in the
> > module
> >    loader. Since vmalloc reuses the allocation for the work queue linked
> > list
> >    node for the deferred frees, the memory actually gets touched as part of
> > the
> >    vfree operation and so returns to the TLB even after the flush from
> > resetting
> >    the permissions.
> > 
> > I have only actually tested category 1, and identified 2 and 3 just from
> > reading
> > the code.
> > 
> > To catch all of these, module_alloc for x86 is changed to use a new flag
> > that
> > instructs the unmap operation to flush the TLB before freeing the pages.
> > 
> > If this solution seems good I can plug the flag in for other architectures
> > that
> > define PAGE_KERNEL_EXEC.
> > 
> > 
> > Rick Edgecombe (2):
> >   vmalloc: New flag for flush before releasing pages
> >   x86/modules: Make x86 allocs to flush when free
> > 
> >  arch/x86/kernel/module.c |  4 ++--
> >  include/linux/vmalloc.h  |  1 +
> >  mm/vmalloc.c             | 13 +++++++++++--
> >  3 files changed, 14 insertions(+), 4 deletions(-)
> > 
> > -- 
> > 2.17.1
> > 
> 
>

Masami Hiramatsu (Google) Nov. 29, 2018, 11:19 p.m. UTC | #7

On Thu, 29 Nov 2018 18:49:26 +0000
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> On Thu, 2018-11-29 at 23:06 +0900, Masami Hiramatsu wrote:
> > On Tue, 27 Nov 2018 16:07:52 -0800
> > Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:
> > 
> > > Sometimes when memory is freed via the module subsystem, an executable
> > > permissioned TLB entry can remain to a freed page. If the page is re-used to
> > > back an address that will receive data from userspace, it can result in user
> > > data being mapped as executable in the kernel. The root of this behavior is
> > > vfree lazily flushing the TLB, but not lazily freeing the underlying pages. 
> > 
> > Good catch!
> > 
> > > 
> > > There are sort of three categories of this which show up across modules,
> > > bpf,
> > > kprobes and ftrace:
> > 
> > For x86-64 kprobe, it sets the page NX and after that RW, and then release
> > via module_memfree. So I'm not sure it really happens on kprobes. (Of course
> > the default memory allocator is simpler so it may happen on other archs) But
> > interesting fixes.
> Yes, I think you are right, it should not leave an executable TLB entry in this
> case. Ftrace actually does this on x86 as well.
> 
> Is there some other reason for calling set_memory_nx that should apply elsewhere
> for module users? Or could it be removed in the case of this patch to centralize
> the behavior?

According to the commit c93f5cf571e7 ("kprobes/x86: Fix to set RWX bits correctly
before releasing trampoline"), if we release readonly page by module_memfree(),
it causes kernel crash. And at this moment, on x86-64 set the trampoline page
readonly becuase it is an exacutable page. Setting NX bit is for security reason
that should be set before making it writable.
So I think if you centralize setting NX bit, it should be done before setting
writable bit.

Thank you,

[0/2] Don’t leave executable TLB entries to freed pages

Message

Comments