mbox series

[v2,0/4] Don’t leave executable TLB entries to freed pages

Message ID 20181212000354.31955-1-rick.p.edgecombe@intel.com (mailing list archive)
Headers show
Series Don’t leave executable TLB entries to freed pages | expand

Message

Rick Edgecombe Dec. 12, 2018, 12:03 a.m. UTC
Sometimes when memory is freed via the module subsystem, an executable
permissioned TLB entry can remain to a freed page. If the page is re-used to
back an address that will receive data from userspace, it can result in user
data being mapped as executable in the kernel. The root of this behavior is
vfree lazily flushing the TLB, but not lazily freeing the underlying pages.

This v2 enables vfree to handle freeing memory with special permissions. So now
it can be done with no W^X window, centralizing the logic for this operation,
and also to do this with only one TLB flush on x86.

I'm not sure if the algorithm Andy Lutomirski suggested (to do the whole
teardown with one TLB flush) will work across other architectures or not, so it
is in an x86 arch breakout(arch_vunmap) in this version. The default arch_vunmap
implementation does what Nadav is proposing users of module_alloc do on tear
down so it should be unchanged in behavior, just centralized. The main
difference will be BPF teardown will now get an extra TLB flush on archs that
have set_memory_* defined from set_memory_nx in addition to set_memory_rw. On
x86, due to the more efficient arch version, it will be unchanged at one flush.

The logic enabling this behavior is plugged into kernel/module.c and bpf cross
arch pieces. So it should be enabled for all architectures for regular .ko
modules and bpf but the other module_alloc users will be unchanged for now.

I did find one small downside with this approach, and that is that there is
occasionally one extra directmap page split in modules tear down, since one of
the modules subsections is RW. The x86 arch_vunmap will set the RW directmap of
the pages not present, since it doesn't know the whole thing is not executable,
so sometimes this results in an splitting an extra large page because the paging
structure would have its first special permission. But on the plus side many TLB
flushes are reduced down to one (on x86 here, and likely others in the future).
The other usages of modules (bpf, etc) will not have RW subsections and so this
will not increase. So I am thinking its not a big downside for a few modules
compared to reducing TLB flushes, removing executable stale TLB entries and code
simplicity.

Todo:
 - Merge with Nadav Amit's patchset
 - Test on x86 32 bit with highmem
 - Plug into ftrace and kprobes implementations in Nadav's next version of his
   patchset

Changes since v1:
 - New efficient algorithm on x86 for tearing down executable RO memory and
   flag for this (Andy Lutomirski)
 - Have no W^X violating window on tear down (Nadav Amit)


Rick Edgecombe (4):
  vmalloc: New flags for safe vfree on special perms
  modules: Add new special vfree flags
  bpf: switch to new vmalloc vfree flags
  x86/vmalloc: Add TLB efficient x86 arch_vunmap

 arch/x86/include/asm/set_memory.h |  2 +
 arch/x86/mm/Makefile              |  3 +-
 arch/x86/mm/pageattr.c            | 11 +++--
 arch/x86/mm/vmalloc.c             | 71 ++++++++++++++++++++++++++++++
 include/linux/filter.h            | 26 +++++------
 include/linux/vmalloc.h           |  2 +
 kernel/bpf/core.c                 |  1 -
 kernel/module.c                   | 43 +++++-------------
 mm/vmalloc.c                      | 73 ++++++++++++++++++++++++++++---
 9 files changed, 173 insertions(+), 59 deletions(-)
 create mode 100644 arch/x86/mm/vmalloc.c