From patchwork Wed Dec 12 00:03:50 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Edgecombe, Rick P" X-Patchwork-Id: 10725323 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8E1496C5 for ; Wed, 12 Dec 2018 00:12:13 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7726C2B550 for ; Wed, 12 Dec 2018 00:12:13 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 6475A2B583; Wed, 12 Dec 2018 00:12:13 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id DB7A42B550 for ; Wed, 12 Dec 2018 00:12:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7BB5E8E00E6; Tue, 11 Dec 2018 19:12:11 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 768A58E00E7; Tue, 11 Dec 2018 19:12:11 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 619C68E00E6; Tue, 11 Dec 2018 19:12:11 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pl1-f200.google.com (mail-pl1-f200.google.com [209.85.214.200]) by kanga.kvack.org (Postfix) with ESMTP id 191808E00C9 for ; Tue, 11 Dec 2018 19:12:11 -0500 (EST) Received: by mail-pl1-f200.google.com with SMTP id a10so11738343plp.14 for ; Tue, 11 Dec 2018 16:12:11 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:from:to:cc :subject:date:message-id:mime-version:content-transfer-encoding; bh=ltZZ4sxo/t6qpfZyU9NRghu6gqtHGzvZLJbu2qcaUvk=; b=Z8ok0VYQNrXtdvRnt8X5rfqv+n5tRX6C1WBpNcJ6+QJczEDMwCjNKksTlyGam9T5gX wzZvfywT25CP8a61XbzP7zBEK0cql0BuJuerjbWv4quub0nUecC+W1s+0yZa6I/4xIgj kPvs4u023O/p+KqtmC6qa3+DceDoarGQSe6GJcCrJgv32U9ppGakoBUpU/gu0Rgo5Ef4 GrVhqJzWnme5c/hiQMoSd1vJfv9JAfmQ19uyLgBKhRA2QG/KLLs/zc7mQFtUbmxQVDOo bxsHhJd79xEaO/30XS6wgD6yawkZcxwyRb/+t1Ki2p5j6FUYM9r+LZ1QLV1p6w6lZh2f uyeA== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of rick.p.edgecombe@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=rick.p.edgecombe@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Gm-Message-State: AA+aEWbKDqrBo5l0Uj4ahm03f3X2XMGrw7kzHemjQ0SISFaGEvLfVNPG q9671hYct6zq43U5yIxJifuTae5uj/wyvtyjfRK+xGnZEliIdwhQQovnebqtHPVIzr8Y2n5Y41W tBf0DPyvUDNY4+gpxSCL0aJSoy2b6dbp8L4Kd5x4esjPYXbIPhDPZKIfKRQ8mTAQi5Q== X-Received: by 2002:a65:6094:: with SMTP id t20mr16277500pgu.285.1544573530732; Tue, 11 Dec 2018 16:12:10 -0800 (PST) X-Google-Smtp-Source: AFSGD/W4Jdx16fqbbJIXwEee7cvL4lYifanBA7yDoOZkc9/+uZ1wBfiF9knMvRRADCwAvw8U7sJw X-Received: by 2002:a65:6094:: with SMTP id t20mr16277456pgu.285.1544573529719; Tue, 11 Dec 2018 16:12:09 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1544573529; cv=none; d=google.com; s=arc-20160816; b=uFbPS+qjFdDoHkmTBlJBAcsTypVgaDRfPvPyfpULEtOatLq/cW02eZ7tsZUCWZh4xC mKM029lN63/nuNcwYMAaE7uPRb1Wu0UHL/3t8U88/9CdujTmQwefDm9waztXqhz9USYz E3oehkG8fanMU6dCeYPdLnLxOjLGs54WP32nryJtPktEQbQiUNiG5uPRmV2cXkiBRKW1 BgyzjrBvwzu5H0+2hlqGkUB4kZHnXpcT6GVojgVqmAgi/6+EYEyJGvaPEIG7zNmgA3Tw 8zujKfp+I3Au3kJ8CfpFM97lvSGjAXxImpymQHokpwYpvLgWSc4iqrQH36t2nqSyXhQ9 8sVg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from; bh=ltZZ4sxo/t6qpfZyU9NRghu6gqtHGzvZLJbu2qcaUvk=; b=SI9ugtPPo91RFrMCoje8XB4QdqlhcBKqcpiibZGdWGbeRqRQY1zSUM7FS8QM/Yy3+i HYpepJWffQK0ixxHdbKvSAR/dGoKwhOt8tNAgxlvi0oOzXsMOH2Wgk3sPWVp9Nn5d6Ln ZUm77mwORWOwoylqVCP3rghUwrrbOLI2OVealPgg+6XrnMx0K3ZL/8sUlSYfZIqBVY4b SFEbRTXdmG2oderDqva2Hxa65ypus9ViiVG3iePVYxF505TFOVYGrPpy4zo83ZhcKH6/ KdTTPKuvrzuq6El3D6CWq05M0lA9J1zVqa6w2WRLmE8MFWUvsUZUkTtWWSj48LQJy5DB hJAg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of rick.p.edgecombe@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=rick.p.edgecombe@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from mga06.intel.com (mga06.intel.com. [134.134.136.31]) by mx.google.com with ESMTPS id f18si13139318pgl.457.2018.12.11.16.12.09 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 11 Dec 2018 16:12:09 -0800 (PST) Received-SPF: pass (google.com: domain of rick.p.edgecombe@intel.com designates 134.134.136.31 as permitted sender) client-ip=134.134.136.31; Authentication-Results: mx.google.com; spf=pass (google.com: domain of rick.p.edgecombe@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=rick.p.edgecombe@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga104.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 11 Dec 2018 16:12:07 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,343,1539673200"; d="scan'208";a="282839396" Received: from rpedgeco-desk5.jf.intel.com ([10.54.75.141]) by orsmga005.jf.intel.com with ESMTP; 11 Dec 2018 16:12:07 -0800 From: Rick Edgecombe To: akpm@linux-foundation.org, luto@kernel.org, will.deacon@arm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-hardening@lists.openwall.com, naveen.n.rao@linux.vnet.ibm.com, anil.s.keshavamurthy@intel.com, davem@davemloft.net, mhiramat@kernel.org, rostedt@goodmis.org, mingo@redhat.com, ast@kernel.org, daniel@iogearbox.net, jeyu@kernel.org, namit@vmware.com, netdev@vger.kernel.org, ard.biesheuvel@linaro.org, jannh@google.com Cc: kristen@linux.intel.com, dave.hansen@intel.com, deneen.t.dock@intel.com, Rick Edgecombe Subject: =?utf-8?q?=5BPATCH_v2_0/4=5D_Don=E2=80=99t_leave_executable_TLB_ent?= =?utf-8?q?ries_to_freed_pages?= Date: Tue, 11 Dec 2018 16:03:50 -0800 Message-Id: <20181212000354.31955-1-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.17.1 MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Sometimes when memory is freed via the module subsystem, an executable permissioned TLB entry can remain to a freed page. If the page is re-used to back an address that will receive data from userspace, it can result in user data being mapped as executable in the kernel. The root of this behavior is vfree lazily flushing the TLB, but not lazily freeing the underlying pages. This v2 enables vfree to handle freeing memory with special permissions. So now it can be done with no W^X window, centralizing the logic for this operation, and also to do this with only one TLB flush on x86. I'm not sure if the algorithm Andy Lutomirski suggested (to do the whole teardown with one TLB flush) will work across other architectures or not, so it is in an x86 arch breakout(arch_vunmap) in this version. The default arch_vunmap implementation does what Nadav is proposing users of module_alloc do on tear down so it should be unchanged in behavior, just centralized. The main difference will be BPF teardown will now get an extra TLB flush on archs that have set_memory_* defined from set_memory_nx in addition to set_memory_rw. On x86, due to the more efficient arch version, it will be unchanged at one flush. The logic enabling this behavior is plugged into kernel/module.c and bpf cross arch pieces. So it should be enabled for all architectures for regular .ko modules and bpf but the other module_alloc users will be unchanged for now. I did find one small downside with this approach, and that is that there is occasionally one extra directmap page split in modules tear down, since one of the modules subsections is RW. The x86 arch_vunmap will set the RW directmap of the pages not present, since it doesn't know the whole thing is not executable, so sometimes this results in an splitting an extra large page because the paging structure would have its first special permission. But on the plus side many TLB flushes are reduced down to one (on x86 here, and likely others in the future). The other usages of modules (bpf, etc) will not have RW subsections and so this will not increase. So I am thinking its not a big downside for a few modules compared to reducing TLB flushes, removing executable stale TLB entries and code simplicity. Todo: - Merge with Nadav Amit's patchset - Test on x86 32 bit with highmem - Plug into ftrace and kprobes implementations in Nadav's next version of his patchset Changes since v1: - New efficient algorithm on x86 for tearing down executable RO memory and flag for this (Andy Lutomirski) - Have no W^X violating window on tear down (Nadav Amit) Rick Edgecombe (4): vmalloc: New flags for safe vfree on special perms modules: Add new special vfree flags bpf: switch to new vmalloc vfree flags x86/vmalloc: Add TLB efficient x86 arch_vunmap arch/x86/include/asm/set_memory.h | 2 + arch/x86/mm/Makefile | 3 +- arch/x86/mm/pageattr.c | 11 +++-- arch/x86/mm/vmalloc.c | 71 ++++++++++++++++++++++++++++++ include/linux/filter.h | 26 +++++------ include/linux/vmalloc.h | 2 + kernel/bpf/core.c | 1 - kernel/module.c | 43 +++++------------- mm/vmalloc.c | 73 ++++++++++++++++++++++++++++--- 9 files changed, 173 insertions(+), 59 deletions(-) create mode 100644 arch/x86/mm/vmalloc.c