From patchwork Wed Dec 12 00:03:50 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Rick Edgecombe X-Patchwork-Id: 10725333 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B5683159A for ; Wed, 12 Dec 2018 00:12:24 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A58952B550 for ; Wed, 12 Dec 2018 00:12:24 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 98CFA2B583; Wed, 12 Dec 2018 00:12:24 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from mother.openwall.net (mother.openwall.net [195.42.179.200]) by mail.wl.linuxfoundation.org (Postfix) with SMTP id B3D312B550 for ; Wed, 12 Dec 2018 00:12:23 +0000 (UTC) Received: (qmail 29757 invoked by uid 550); 12 Dec 2018 00:12:21 -0000 Mailing-List: contact kernel-hardening-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Delivered-To: mailing list kernel-hardening@lists.openwall.com Received: (qmail 29733 invoked from network); 12 Dec 2018 00:12:20 -0000 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,343,1539673200"; d="scan'208";a="282839396" From: Rick Edgecombe To: akpm@linux-foundation.org, luto@kernel.org, will.deacon@arm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-hardening@lists.openwall.com, naveen.n.rao@linux.vnet.ibm.com, anil.s.keshavamurthy@intel.com, davem@davemloft.net, mhiramat@kernel.org, rostedt@goodmis.org, mingo@redhat.com, ast@kernel.org, daniel@iogearbox.net, jeyu@kernel.org, namit@vmware.com, netdev@vger.kernel.org, ard.biesheuvel@linaro.org, jannh@google.com Cc: kristen@linux.intel.com, dave.hansen@intel.com, deneen.t.dock@intel.com, Rick Edgecombe Subject: =?utf-8?q?=5BPATCH_v2_0/4=5D_Don=E2=80=99t_leave_executable_TLB_ent?= =?utf-8?q?ries_to_freed_pages?= Date: Tue, 11 Dec 2018 16:03:50 -0800 Message-Id: <20181212000354.31955-1-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.17.1 MIME-Version: 1.0 X-Virus-Scanned: ClamAV using ClamSMTP Sometimes when memory is freed via the module subsystem, an executable permissioned TLB entry can remain to a freed page. If the page is re-used to back an address that will receive data from userspace, it can result in user data being mapped as executable in the kernel. The root of this behavior is vfree lazily flushing the TLB, but not lazily freeing the underlying pages. This v2 enables vfree to handle freeing memory with special permissions. So now it can be done with no W^X window, centralizing the logic for this operation, and also to do this with only one TLB flush on x86. I'm not sure if the algorithm Andy Lutomirski suggested (to do the whole teardown with one TLB flush) will work across other architectures or not, so it is in an x86 arch breakout(arch_vunmap) in this version. The default arch_vunmap implementation does what Nadav is proposing users of module_alloc do on tear down so it should be unchanged in behavior, just centralized. The main difference will be BPF teardown will now get an extra TLB flush on archs that have set_memory_* defined from set_memory_nx in addition to set_memory_rw. On x86, due to the more efficient arch version, it will be unchanged at one flush. The logic enabling this behavior is plugged into kernel/module.c and bpf cross arch pieces. So it should be enabled for all architectures for regular .ko modules and bpf but the other module_alloc users will be unchanged for now. I did find one small downside with this approach, and that is that there is occasionally one extra directmap page split in modules tear down, since one of the modules subsections is RW. The x86 arch_vunmap will set the RW directmap of the pages not present, since it doesn't know the whole thing is not executable, so sometimes this results in an splitting an extra large page because the paging structure would have its first special permission. But on the plus side many TLB flushes are reduced down to one (on x86 here, and likely others in the future). The other usages of modules (bpf, etc) will not have RW subsections and so this will not increase. So I am thinking its not a big downside for a few modules compared to reducing TLB flushes, removing executable stale TLB entries and code simplicity. Todo: - Merge with Nadav Amit's patchset - Test on x86 32 bit with highmem - Plug into ftrace and kprobes implementations in Nadav's next version of his patchset Changes since v1: - New efficient algorithm on x86 for tearing down executable RO memory and flag for this (Andy Lutomirski) - Have no W^X violating window on tear down (Nadav Amit) Rick Edgecombe (4): vmalloc: New flags for safe vfree on special perms modules: Add new special vfree flags bpf: switch to new vmalloc vfree flags x86/vmalloc: Add TLB efficient x86 arch_vunmap arch/x86/include/asm/set_memory.h | 2 + arch/x86/mm/Makefile | 3 +- arch/x86/mm/pageattr.c | 11 +++-- arch/x86/mm/vmalloc.c | 71 ++++++++++++++++++++++++++++++ include/linux/filter.h | 26 +++++------ include/linux/vmalloc.h | 2 + kernel/bpf/core.c | 1 - kernel/module.c | 43 +++++------------- mm/vmalloc.c | 73 ++++++++++++++++++++++++++++--- 9 files changed, 173 insertions(+), 59 deletions(-) create mode 100644 arch/x86/mm/vmalloc.c