From patchwork Fri Nov 20 20:24:17 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Edgecombe, Rick P" X-Patchwork-Id: 11922297 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 765FFC64E90 for ; Fri, 20 Nov 2020 20:31:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 347892223F for ; Fri, 20 Nov 2020 20:31:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731141AbgKTUa1 (ORCPT ); Fri, 20 Nov 2020 15:30:27 -0500 Received: from mga07.intel.com ([134.134.136.100]:10342 "EHLO mga07.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730982AbgKTUaC (ORCPT ); Fri, 20 Nov 2020 15:30:02 -0500 IronPort-SDR: qNoiMpM0py60yXGsxFgMF2HEvEqkBlz2h5XHGkx2FGsdwOi6LEypv3c5d4eQ299JOZO/A0B9F5 WKvtOEaWRghg== X-IronPort-AV: E=McAfee;i="6000,8403,9811"; a="235683271" X-IronPort-AV: E=Sophos;i="5.78,357,1599548400"; d="scan'208";a="235683271" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Nov 2020 12:30:01 -0800 IronPort-SDR: dAe5oTws0poBfSLk4kjl9AcE0urnr2caDZLch0+CDP8DOUTygKloKA0cixjo5rjNeCqyOi9QHs 0bCDmiUZNaQg== X-IronPort-AV: E=Sophos;i="5.78,357,1599548400"; d="scan'208";a="342163272" Received: from rpedgeco-mobl.amr.corp.intel.com (HELO localhost.intel.com) ([10.209.105.214]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Nov 2020 12:30:00 -0800 From: Rick Edgecombe To: akpm@linux-foundation.org, jeyu@kernel.org, bpf@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net, luto@kernel.org, dave.hansen@linux.intel.com, peterz@infradead.org, x86@kernel.org, rppt@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, dan.j.williams@intel.com Cc: elena.reshetova@intel.com, ira.weiny@intel.com, Rick Edgecombe Subject: [PATCH RFC 01/10] vmalloc: Add basic perm alloc implementation Date: Fri, 20 Nov 2020 12:24:17 -0800 Message-Id: <20201120202426.18009-2-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20201120202426.18009-1-rick.p.edgecombe@intel.com> References: <20201120202426.18009-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-State: RFC In order to allow for future arch specific optimizations for vmalloc permissions, first add an implementation of a new interface that will work cross arch by using the existing set_memory_() functions. When allocating some memory that will be RO, for example it should be used like: /* Reserve va */ struct perm_allocation *alloc = perm_alloc(vstart, vend, page_cnt, PERM_R); unsigned long ro = (unsigned long)perm_alloc_address(alloc); /* Write to writable address */ strcpy((char *)perm_writable_addr(alloc, ro), "Some data to be RO"); /* Signal that writing is done and mapping should be live */ perm_writable_finish(alloc); /* Print from RO address */ printk("Read only data is: %s\n", (char *)ro); Create some new flags to handle the memory permissions currently defined cross-architectually in the set_memory_() function names themselves. The PAGE_ defines are not uniform across the architectures, so couldn't be used without unifying them. However in the future there may also be some other flags, for example requesting to try to allocate into part of a 2MB page for longer lived allocations. Have the default implementation use the primary address for loading the data as is done today for special kernel permission usages. However, make the interface compatible with having the writable data loaded at a separate address or via some PKS backed solution. Allocate using module_alloc() in the default implementation in order to allocate from each arch's chosen place for executable code. Signed-off-by: Rick Edgecombe --- arch/Kconfig | 3 + include/linux/vmalloc.h | 82 ++++++++++++++++++++++++ mm/nommu.c | 66 ++++++++++++++++++++ mm/vmalloc.c | 135 ++++++++++++++++++++++++++++++++++++++++ 4 files changed, 286 insertions(+) diff --git a/arch/Kconfig b/arch/Kconfig index 56b6ccc0e32d..0fa42f76548d 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -259,6 +259,9 @@ config ARCH_HAS_SET_MEMORY config ARCH_HAS_SET_DIRECT_MAP bool +config ARCH_HAS_PERM_ALLOC_IMPLEMENTATION + bool + # # Select if the architecture provides the arch_dma_set_uncached symbol to # either provide an uncached segement alias for a DMA allocation, or diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 938eaf9517e2..4a6b30014fff 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -248,4 +248,86 @@ pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms) int register_vmap_purge_notifier(struct notifier_block *nb); int unregister_vmap_purge_notifier(struct notifier_block *nb); +#define PERM_R 1 +#define PERM_W 2 +#define PERM_X 4 +#define PERM_RWX (PERM_R | PERM_W | PERM_X) +#define PERM_RW (PERM_R | PERM_W) +#define PERM_RX (PERM_R | PERM_X) + +typedef u8 virtual_perm; + +struct perm_allocation { + struct page **pages; + virtual_perm cur_perm; + virtual_perm orig_perm; + struct vm_struct *area; + unsigned long offset; + unsigned long size; + void *writable; +}; + +/* + * Allocate a special permission kva region. The region may not be mapped + * until a call to perm_writable_finish(). A writable region will be mapped + * immediately at the address returned by perm_writable_addr(). The allocation + * will be made between the start and end virtual addresses. + */ +struct perm_allocation *perm_alloc(unsigned long vstart, unsigned long vend, unsigned long page_cnt, + virtual_perm perms); + +/* The writable address for data to be loaded into the allocation */ +unsigned long perm_writable_addr(struct perm_allocation *alloc, unsigned long addr); + +/* The writable address for data to be loaded into the allocation */ +bool perm_writable_finish(struct perm_allocation *alloc); + +/* Change the permission of an allocation that is already live */ +bool perm_change(struct perm_allocation *alloc, virtual_perm perms); + +/* Free an allocation */ +void perm_free(struct perm_allocation *alloc); + +/* Helper for memsetting an allocation. Should be called before perm_writable_finish() */ +void perm_memset(struct perm_allocation *alloc, char val); + +/* The final address of the allocation */ +static inline unsigned long perm_alloc_address(const struct perm_allocation *alloc) +{ + return (unsigned long)alloc->area->addr + alloc->offset; +} + +/* The size of the allocation */ +static inline unsigned long perm_alloc_size(const struct perm_allocation *alloc) +{ + return alloc->size; +} + +static inline unsigned long within_perm_alloc(const struct perm_allocation *alloc, + unsigned long addr) +{ + unsigned long base, size; + + if (!alloc) + return false; + + base = perm_alloc_address(alloc); + size = perm_alloc_size(alloc); + + return base <= addr && addr < base + size; +} + +static inline unsigned long perm_writable_base(struct perm_allocation *alloc) +{ + return perm_writable_addr(alloc, perm_alloc_address(alloc)); +} + +static inline bool perm_is_writable(struct perm_allocation *alloc) +{ + if (!alloc) + return false; + + return (alloc->cur_perm & PERM_W) || alloc->writable; +} + #endif /* _LINUX_VMALLOC_H */ diff --git a/mm/nommu.c b/mm/nommu.c index 0faf39b32cdb..6458bd23de3e 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1810,6 +1810,72 @@ int nommu_shrink_inode_mappings(struct inode *inode, size_t size, return 0; } +struct perm_allocation *perm_alloc(unsigned long vstart, unsigned long vend, unsigned long page_cnt, + virtual_perm perms) +{ + struct perm_allocation *alloc; + struct vm_struct *area; + unsigned long size = page_cnt << PAGE_SHIFT; + void *ptr; + + if (!size) + return NULL; + + alloc = kmalloc(sizeof(*alloc), GFP_KERNEL | __GFP_ZERO); + + if (!alloc) + return NULL; + + area = kmalloc(sizeof(*area), GFP_KERNEL | __GFP_ZERO); + + if (!area) + goto free_alloc; + + alloc->area = area; + + ptr = vmalloc(size); + + if (!ptr) + goto free_area; + + alloc->size = size; + alloc->cur_perm = PERM_RWX; + + return alloc; + +free_area: + kfree(area); +free_alloc: + kfree(alloc); + return NULL; +} + +unsigned long perm_writable_addr(struct perm_allocation *alloc, unsigned long addr) +{ + return addr; +} + +bool perm_writable_finish(struct perm_allocation *alloc) +{ + return true; +} + +bool perm_change(struct perm_allocation *alloc, virtual_perm perms) +{ + return true; +} + +void perm_free(struct perm_allocation *alloc) +{ + if (!alloc) + return; + + kfree(alloc->area); + kfree(alloc); +} + +void perm_memset(struct perm_allocation *alloc, char val) {} + /* * Initialise sysctl_user_reserve_kbytes. * diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 6ae491a8b210..3e8e54a75dfc 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -34,6 +34,7 @@ #include #include #include +#include #include #include @@ -3088,6 +3089,140 @@ void free_vm_area(struct vm_struct *area) } EXPORT_SYMBOL_GPL(free_vm_area); +#ifndef CONFIG_ARCH_HAS_PERM_ALLOC_IMPLEMENTATION + +#ifndef CONFIG_MODULES +/* If modules is not configured, provide stubs so perm_alloc() could use fallback logic. */ +void *module_alloc(unsigned long size) +{ + return NULL; +} + +void module_memfree(void *module_region) { } +#endif /* !CONFIG_MODULES */ + +struct perm_allocation *perm_alloc(unsigned long vstart, unsigned long vend, unsigned long page_cnt, + virtual_perm perms) +{ + struct perm_allocation *alloc; + unsigned long size = page_cnt << PAGE_SHIFT; + void *ptr; + + if (!size) + return NULL; + + alloc = kmalloc(sizeof(*alloc), GFP_KERNEL | __GFP_ZERO); + + if (!alloc) + return NULL; + + ptr = module_alloc(size); + + if (!ptr) { + kfree(alloc); + return NULL; + } + + /* + * In order to work with all arch's we call the arch's module_alloc() which is the only + * cross-arch place where information about where an executable allocation should go is + * located. If the caller passed in a different range they want for the allocation...we + * could try a vmalloc_node_range() at this point, but just return NULL for now. + */ + if ((unsigned long)ptr < vstart || (unsigned long)ptr >= vend) { + module_memfree(ptr); + kfree(alloc); + return NULL; + } + + alloc->area = find_vm_area(ptr); + alloc->size = size; + + if (IS_ENABLED(CONFIG_ARM) || IS_ENABLED(CONFIG_X86)) + alloc->cur_perm = PERM_RW; + else + alloc->cur_perm = PERM_RWX; + + alloc->orig_perm = perms; + + return alloc; +} + +unsigned long perm_writable_addr(struct perm_allocation *alloc, unsigned long addr) +{ + return addr; +} + +bool perm_writable_finish(struct perm_allocation *alloc) +{ + if (!alloc) + return false; + + return perm_change(alloc, alloc->orig_perm); +} + +bool perm_change(struct perm_allocation *alloc, virtual_perm perm) +{ + unsigned long start, npages; + virtual_perm unset, set; + + if (!alloc) + return false; + + npages = alloc->size >> PAGE_SHIFT; + + start = perm_alloc_address(alloc); + + set = ~alloc->cur_perm & perm; + unset = alloc->cur_perm & ~perm; + + if (set & PERM_W) + set_memory_rw(start, npages); + + if (unset & PERM_W) + set_memory_ro(start, npages); + + if (set & PERM_X) + set_memory_x(start, npages); + + if (unset & PERM_X) + set_memory_nx(start, npages); + + alloc->cur_perm = perm; + + return false; +} + +static inline bool perms_need_reset(struct perm_allocation *alloc) +{ + return (alloc->cur_perm & PERM_X) || (~alloc->cur_perm & PERM_W); +} + +void perm_free(struct perm_allocation *alloc) +{ + unsigned long addr; + + if (!alloc) + return; + + addr = perm_alloc_address(alloc); + + if (perms_need_reset(alloc)) + set_vm_flush_reset_perms((void *)addr); + + module_memfree((void *)addr); + + kfree(alloc); +} + +void perm_memset(struct perm_allocation *alloc, char val) +{ + if (!alloc) + return; + memset((void *)perm_writable_base(alloc), val, perm_alloc_size(alloc)); +} +#endif /* CONFIG_ARCH_HAS_PERM_ALLOC_IMPLEMENTATION */ + #ifdef CONFIG_SMP static struct vmap_area *node_to_va(struct rb_node *n) { From patchwork Fri Nov 20 20:24:18 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Edgecombe, Rick P" X-Patchwork-Id: 11922167 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B7C85C64EBC for ; Fri, 20 Nov 2020 20:30:06 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7F3EC22D0A for ; Fri, 20 Nov 2020 20:30:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731049AbgKTUaF (ORCPT ); Fri, 20 Nov 2020 15:30:05 -0500 Received: from mga07.intel.com ([134.134.136.100]:10348 "EHLO mga07.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731002AbgKTUaD (ORCPT ); Fri, 20 Nov 2020 15:30:03 -0500 IronPort-SDR: RquLtYz69QL2sR+LQy1J2V07RRhg+c/g6UgT5wBEM3h9/ov49Lm2a9RtHkYkc1qAfaS63RUFk8 dBh/D/e40zWQ== X-IronPort-AV: E=McAfee;i="6000,8403,9811"; a="235683276" X-IronPort-AV: E=Sophos;i="5.78,357,1599548400"; d="scan'208";a="235683276" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Nov 2020 12:30:01 -0800 IronPort-SDR: EaBQ0TDYD94uoER5AYZzfjKZEO6uazvV01eI6DqJEuSLuuSq8i5pj2CtRN5hjVHrPp9ryZCl46 0bbdeReDqA2g== X-IronPort-AV: E=Sophos;i="5.78,357,1599548400"; d="scan'208";a="342163278" Received: from rpedgeco-mobl.amr.corp.intel.com (HELO localhost.intel.com) ([10.209.105.214]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Nov 2020 12:30:01 -0800 From: Rick Edgecombe To: akpm@linux-foundation.org, jeyu@kernel.org, bpf@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net, luto@kernel.org, dave.hansen@linux.intel.com, peterz@infradead.org, x86@kernel.org, rppt@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, dan.j.williams@intel.com Cc: elena.reshetova@intel.com, ira.weiny@intel.com, Rick Edgecombe Subject: [PATCH RFC 02/10] bpf: Use perm_alloc() for BPF JIT filters Date: Fri, 20 Nov 2020 12:24:18 -0800 Message-Id: <20201120202426.18009-3-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20201120202426.18009-1-rick.p.edgecombe@intel.com> References: <20201120202426.18009-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC eBPF has other executable allocations besides filters, but just convert over the filters for now. Since struct perm_allocation has size information, no longer track this separately. For x86, write the JIT to the address provided by perm_writable_addr() so that in later patches this can be directed to a separate writable staging area. Signed-off-by: Rick Edgecombe --- arch/arm/net/bpf_jit_32.c | 3 +- arch/arm64/net/bpf_jit_comp.c | 5 ++-- arch/mips/net/bpf_jit.c | 2 +- arch/mips/net/ebpf_jit.c | 3 +- arch/powerpc/net/bpf_jit_comp.c | 2 +- arch/powerpc/net/bpf_jit_comp64.c | 10 +++---- arch/s390/net/bpf_jit_comp.c | 5 ++-- arch/sparc/net/bpf_jit_comp_32.c | 2 +- arch/sparc/net/bpf_jit_comp_64.c | 5 ++-- arch/x86/net/bpf_jit_comp.c | 15 ++++++++-- arch/x86/net/bpf_jit_comp32.c | 3 +- include/linux/filter.h | 30 ++++++------------- kernel/bpf/core.c | 48 +++++++++++++++---------------- 13 files changed, 65 insertions(+), 68 deletions(-) diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c index 0207b6ea6e8a..83b7d7a7a833 100644 --- a/arch/arm/net/bpf_jit_32.c +++ b/arch/arm/net/bpf_jit_32.c @@ -1874,7 +1874,7 @@ bool bpf_jit_needs_zext(void) struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) { struct bpf_prog *tmp, *orig_prog = prog; - struct bpf_binary_header *header; + struct perm_allocation *header; bool tmp_blinded = false; struct jit_ctx ctx; unsigned int tmp_idx; @@ -1971,6 +1971,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) prog = orig_prog; goto out_imms; } + prog->alloc = header; /* 2.) Actual pass to generate final JIT code */ ctx.target = (u32 *) image_ptr; diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index ef9f1d5e989d..371a1b8a7fa3 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -970,7 +970,7 @@ static inline void bpf_flush_icache(void *start, void *end) } struct arm64_jit_data { - struct bpf_binary_header *header; + struct perm_allocation *header; u8 *image; struct jit_ctx ctx; }; @@ -979,7 +979,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) { int image_size, prog_size, extable_size; struct bpf_prog *tmp, *orig_prog = prog; - struct bpf_binary_header *header; + struct perm_allocation *header; struct arm64_jit_data *jit_data; bool was_classic = bpf_prog_was_classic(prog); bool tmp_blinded = false; @@ -1055,6 +1055,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) prog = orig_prog; goto out_off; } + prog->alloc = header; /* 2. Now, the actual pass. */ diff --git a/arch/mips/net/bpf_jit.c b/arch/mips/net/bpf_jit.c index 0af88622c619..c8ea5c00ce55 100644 --- a/arch/mips/net/bpf_jit.c +++ b/arch/mips/net/bpf_jit.c @@ -1264,7 +1264,7 @@ void bpf_jit_compile(struct bpf_prog *fp) void bpf_jit_free(struct bpf_prog *fp) { if (fp->jited) - module_memfree(fp->bpf_func); + bpf_jit_binary_free(fp->alloc); bpf_prog_unlock_free(fp); } diff --git a/arch/mips/net/ebpf_jit.c b/arch/mips/net/ebpf_jit.c index 561154cbcc40..e251d0cd33d8 100644 --- a/arch/mips/net/ebpf_jit.c +++ b/arch/mips/net/ebpf_jit.c @@ -1799,7 +1799,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) struct bpf_prog *orig_prog = prog; bool tmp_blinded = false; struct bpf_prog *tmp; - struct bpf_binary_header *header = NULL; + struct perm_allocation *header = NULL; struct jit_ctx ctx; unsigned int image_size; u8 *image_ptr; @@ -1889,6 +1889,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) sizeof(u32), jit_fill_hole); if (header == NULL) goto out_err; + prog->alloc = header; ctx.target = (u32 *)image_ptr; diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c index e809cb5a1631..7891a8920c68 100644 --- a/arch/powerpc/net/bpf_jit_comp.c +++ b/arch/powerpc/net/bpf_jit_comp.c @@ -677,7 +677,7 @@ void bpf_jit_compile(struct bpf_prog *fp) void bpf_jit_free(struct bpf_prog *fp) { if (fp->jited) - module_memfree(fp->bpf_func); + bpf_jit_binary_free(fp->alloc); bpf_prog_unlock_free(fp); } diff --git a/arch/powerpc/net/bpf_jit_comp64.c b/arch/powerpc/net/bpf_jit_comp64.c index 022103c6a201..d78e582580a5 100644 --- a/arch/powerpc/net/bpf_jit_comp64.c +++ b/arch/powerpc/net/bpf_jit_comp64.c @@ -1062,7 +1062,7 @@ static int bpf_jit_fixup_subprog_calls(struct bpf_prog *fp, u32 *image, } struct powerpc64_jit_data { - struct bpf_binary_header *header; + struct perm_allocation *header; u32 *addrs; u8 *image; u32 proglen; @@ -1085,7 +1085,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp) struct codegen_context cgctx; int pass; int flen; - struct bpf_binary_header *bpf_hdr; + struct perm_allocation *bpf_hdr; struct bpf_prog *org_fp = fp; struct bpf_prog *tmp_fp; bool bpf_blinded = false; @@ -1173,6 +1173,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp) fp = org_fp; goto out_addrs; } + fp->alloc = header; skip_init_ctx: code_base = (u32 *)(image + FUNCTION_DESCR_SIZE); @@ -1249,11 +1250,8 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp) /* Overriding bpf_jit_free() as we don't set images read-only. */ void bpf_jit_free(struct bpf_prog *fp) { - unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK; - struct bpf_binary_header *bpf_hdr = (void *)addr; - if (fp->jited) - bpf_jit_binary_free(bpf_hdr); + bpf_jit_binary_free(fp->alloc); bpf_prog_unlock_free(fp); } diff --git a/arch/s390/net/bpf_jit_comp.c b/arch/s390/net/bpf_jit_comp.c index 0a4182792876..e0440307f539 100644 --- a/arch/s390/net/bpf_jit_comp.c +++ b/arch/s390/net/bpf_jit_comp.c @@ -1686,7 +1686,7 @@ bool bpf_jit_needs_zext(void) } struct s390_jit_data { - struct bpf_binary_header *header; + struct perm_allocation *header; struct bpf_jit ctx; int pass; }; @@ -1721,7 +1721,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp) { u32 stack_depth = round_up(fp->aux->stack_depth, 8); struct bpf_prog *tmp, *orig_fp = fp; - struct bpf_binary_header *header; + struct perm_allocation *header; struct s390_jit_data *jit_data; bool tmp_blinded = false; bool extra_pass = false; @@ -1785,6 +1785,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp) fp = orig_fp; goto free_addrs; } + fp->alloc = header; skip_init_ctx: if (bpf_jit_prog(&jit, fp, extra_pass, stack_depth)) { bpf_jit_binary_free(header); diff --git a/arch/sparc/net/bpf_jit_comp_32.c b/arch/sparc/net/bpf_jit_comp_32.c index b1dbf2fa8c0a..2a29d2523fd6 100644 --- a/arch/sparc/net/bpf_jit_comp_32.c +++ b/arch/sparc/net/bpf_jit_comp_32.c @@ -758,7 +758,7 @@ cond_branch: f_offset = addrs[i + filter[i].jf]; void bpf_jit_free(struct bpf_prog *fp) { if (fp->jited) - module_memfree(fp->bpf_func); + bpf_jit_binary_free(fp->alloc); bpf_prog_unlock_free(fp); } diff --git a/arch/sparc/net/bpf_jit_comp_64.c b/arch/sparc/net/bpf_jit_comp_64.c index 3364e2a00989..3ad639166076 100644 --- a/arch/sparc/net/bpf_jit_comp_64.c +++ b/arch/sparc/net/bpf_jit_comp_64.c @@ -1458,7 +1458,7 @@ bool bpf_jit_needs_zext(void) } struct sparc64_jit_data { - struct bpf_binary_header *header; + struct perm_allocation *header; u8 *image; struct jit_ctx ctx; }; @@ -1467,7 +1467,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) { struct bpf_prog *tmp, *orig_prog = prog; struct sparc64_jit_data *jit_data; - struct bpf_binary_header *header; + struct perm_allocation *header; u32 prev_image_size, image_size; bool tmp_blinded = false; bool extra_pass = false; @@ -1559,6 +1559,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) prog = orig_prog; goto out_off; } + prog->alloc = header; ctx.image = (u32 *)image_ptr; skip_init_ctx: diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 796506dcfc42..e11ee6b71d40 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -1476,11 +1476,14 @@ xadd: if (is_imm8(insn->off)) } if (image) { + unsigned long writable = perm_writable_addr(bpf_prog->alloc, + (unsigned long)image); + if (unlikely(proglen + ilen > oldproglen)) { pr_err("bpf_jit: fatal error\n"); return -EFAULT; } - memcpy(image + proglen, temp, ilen); + memcpy((void *)writable + proglen, temp, ilen); } proglen += ilen; addrs[i] = proglen; @@ -1965,16 +1968,21 @@ int arch_prepare_bpf_dispatcher(void *image, s64 *funcs, int num_funcs) } struct x64_jit_data { - struct bpf_binary_header *header; + struct perm_allocation *header; int *addrs; u8 *image; int proglen; struct jit_context ctx; }; +struct perm_allocation *bpf_jit_alloc_exec(unsigned long size) +{ + return perm_alloc(MODULES_VADDR, MODULES_END, size >> PAGE_SHIFT, PERM_RX); +} + struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) { - struct bpf_binary_header *header = NULL; + struct perm_allocation *header = NULL; struct bpf_prog *tmp, *orig_prog = prog; struct x64_jit_data *jit_data; int proglen, oldproglen = 0; @@ -2078,6 +2086,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) prog = orig_prog; goto out_addrs; } + prog->alloc = header; prog->aux->extable = (void *) image + roundup(proglen, align); } oldproglen = proglen; diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c index 96fde03aa987..680ca859b829 100644 --- a/arch/x86/net/bpf_jit_comp32.c +++ b/arch/x86/net/bpf_jit_comp32.c @@ -2298,7 +2298,7 @@ bool bpf_jit_needs_zext(void) struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) { - struct bpf_binary_header *header = NULL; + struct perm_allocation *header = NULL; struct bpf_prog *tmp, *orig_prog = prog; int proglen, oldproglen = 0; struct jit_context ctx = {}; @@ -2370,6 +2370,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) prog = orig_prog; goto out_addrs; } + prog->alloc = header; } oldproglen = proglen; cond_resched(); diff --git a/include/linux/filter.h b/include/linux/filter.h index 1b62397bd124..1ffd93e01550 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -517,11 +517,6 @@ struct sock_fprog_kern { /* Some arches need doubleword alignment for their instructions and/or data */ #define BPF_IMAGE_ALIGNMENT 8 -struct bpf_binary_header { - u32 pages; - u8 image[] __aligned(BPF_IMAGE_ALIGNMENT); -}; - struct bpf_prog { u16 pages; /* Number of allocated pages */ u16 jited:1, /* Is our filter JIT'ed? */ @@ -544,6 +539,8 @@ struct bpf_prog { struct sock_fprog_kern *orig_prog; /* Original BPF program */ unsigned int (*bpf_func)(const void *ctx, const struct bpf_insn *insn); + struct perm_allocation *alloc; + /* Instructions for interpreter */ struct sock_filter insns[0]; struct bpf_insn insnsi[]; @@ -818,20 +815,9 @@ static inline void bpf_prog_lock_ro(struct bpf_prog *fp) #endif } -static inline void bpf_jit_binary_lock_ro(struct bpf_binary_header *hdr) +static inline void bpf_jit_binary_lock_ro(struct perm_allocation *alloc) { - set_vm_flush_reset_perms(hdr); - set_memory_ro((unsigned long)hdr, hdr->pages); - set_memory_x((unsigned long)hdr, hdr->pages); -} - -static inline struct bpf_binary_header * -bpf_jit_binary_hdr(const struct bpf_prog *fp) -{ - unsigned long real_start = (unsigned long)fp->bpf_func; - unsigned long addr = real_start & PAGE_MASK; - - return (void *)addr; + perm_writable_finish(alloc); } int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap); @@ -986,14 +972,14 @@ extern long bpf_jit_limit; typedef void (*bpf_jit_fill_hole_t)(void *area, unsigned int size); -struct bpf_binary_header * +struct perm_allocation * bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr, unsigned int alignment, bpf_jit_fill_hole_t bpf_fill_ill_insns); -void bpf_jit_binary_free(struct bpf_binary_header *hdr); +void bpf_jit_binary_free(struct perm_allocation *hdr); u64 bpf_jit_alloc_exec_limit(void); -void *bpf_jit_alloc_exec(unsigned long size); -void bpf_jit_free_exec(void *addr); +struct perm_allocation *bpf_jit_alloc_exec(unsigned long size); +void bpf_jit_free_exec(struct perm_allocation *alloc); void bpf_jit_free(struct bpf_prog *fp); int bpf_jit_add_poke_descriptor(struct bpf_prog *prog, diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 55454d2278b1..c0088014c65c 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -530,13 +530,13 @@ long bpf_jit_limit __read_mostly; static void bpf_prog_ksym_set_addr(struct bpf_prog *prog) { - const struct bpf_binary_header *hdr = bpf_jit_binary_hdr(prog); - unsigned long addr = (unsigned long)hdr; + const struct perm_allocation *alloc = prog->alloc; + unsigned long addr = perm_alloc_address(alloc); WARN_ON_ONCE(!bpf_prog_ebpf_jited(prog)); - prog->aux->ksym.start = (unsigned long) prog->bpf_func; - prog->aux->ksym.end = addr + hdr->pages * PAGE_SIZE; + prog->aux->ksym.start = (unsigned long)addr; + prog->aux->ksym.end = addr + alloc->size; } static void @@ -843,22 +843,23 @@ static void bpf_jit_uncharge_modmem(u32 pages) atomic_long_sub(pages, &bpf_jit_current); } -void *__weak bpf_jit_alloc_exec(unsigned long size) +struct perm_allocation * __weak bpf_jit_alloc_exec(unsigned long size) { - return module_alloc(size); + /* Note: Range ignored for default perm_alloc implementation */ + return perm_alloc(0, 0, size >> PAGE_SHIFT, PERM_RX); } -void __weak bpf_jit_free_exec(void *addr) +void __weak bpf_jit_free_exec(struct perm_allocation *alloc) { - module_memfree(addr); + perm_free(alloc); } -struct bpf_binary_header * +struct perm_allocation * bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr, unsigned int alignment, bpf_jit_fill_hole_t bpf_fill_ill_insns) { - struct bpf_binary_header *hdr; + struct perm_allocation *alloc; u32 size, hole, start, pages; WARN_ON_ONCE(!is_power_of_2(alignment) || @@ -868,36 +869,35 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr, * fill a page, allow at least 128 extra bytes to insert a * random section of illegal instructions. */ - size = round_up(proglen + sizeof(*hdr) + 128, PAGE_SIZE); + size = round_up(proglen + 128, PAGE_SIZE); pages = size / PAGE_SIZE; if (bpf_jit_charge_modmem(pages)) return NULL; - hdr = bpf_jit_alloc_exec(size); - if (!hdr) { + alloc = bpf_jit_alloc_exec(size); + if (!alloc) { bpf_jit_uncharge_modmem(pages); return NULL; } /* Fill space with illegal/arch-dep instructions. */ - bpf_fill_ill_insns(hdr, size); + bpf_fill_ill_insns((void *)perm_writable_base(alloc), size); - hdr->pages = pages; - hole = min_t(unsigned int, size - (proglen + sizeof(*hdr)), - PAGE_SIZE - sizeof(*hdr)); + hole = min_t(unsigned int, size - proglen, + PAGE_SIZE); start = (get_random_int() % hole) & ~(alignment - 1); /* Leave a random number of instructions before BPF code. */ - *image_ptr = &hdr->image[start]; + *image_ptr = (void *)perm_alloc_address(alloc) + start; - return hdr; + return alloc; } -void bpf_jit_binary_free(struct bpf_binary_header *hdr) +void bpf_jit_binary_free(struct perm_allocation *alloc) { - u32 pages = hdr->pages; + u32 pages = alloc->size >> PAGE_SHIFT; - bpf_jit_free_exec(hdr); + bpf_jit_free_exec(alloc); bpf_jit_uncharge_modmem(pages); } @@ -908,9 +908,7 @@ void bpf_jit_binary_free(struct bpf_binary_header *hdr) void __weak bpf_jit_free(struct bpf_prog *fp) { if (fp->jited) { - struct bpf_binary_header *hdr = bpf_jit_binary_hdr(fp); - - bpf_jit_binary_free(hdr); + bpf_jit_binary_free(fp->alloc); WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(fp)); } From patchwork Fri Nov 20 20:24:19 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Edgecombe, Rick P" X-Patchwork-Id: 11922169 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8540BC64E90 for ; Fri, 20 Nov 2020 20:30:06 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 51FB2246F8 for ; Fri, 20 Nov 2020 20:30:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731040AbgKTUaF (ORCPT ); Fri, 20 Nov 2020 15:30:05 -0500 Received: from mga07.intel.com ([134.134.136.100]:10342 "EHLO mga07.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731004AbgKTUaE (ORCPT ); Fri, 20 Nov 2020 15:30:04 -0500 IronPort-SDR: W4MOjHU9HYtjsN/9DhJKYpLZ2EM8XHpoYfBpi4H6SdFVzHydcYc6tklvzTFHohZvOm4JpzrPYj VJ+XshVrAJAA== X-IronPort-AV: E=McAfee;i="6000,8403,9811"; a="235683282" X-IronPort-AV: E=Sophos;i="5.78,357,1599548400"; d="scan'208";a="235683282" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Nov 2020 12:30:02 -0800 IronPort-SDR: 4EVD3vMNLGoH0Np7LYUKnNgabAJgZMMnIs6Lx+X10YEqMDM3aj3KjYo3lzjY+vGS6GOFlrwCog 3yuhQaxTifTA== X-IronPort-AV: E=Sophos;i="5.78,357,1599548400"; d="scan'208";a="342163287" Received: from rpedgeco-mobl.amr.corp.intel.com (HELO localhost.intel.com) ([10.209.105.214]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Nov 2020 12:30:01 -0800 From: Rick Edgecombe To: akpm@linux-foundation.org, jeyu@kernel.org, bpf@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net, luto@kernel.org, dave.hansen@linux.intel.com, peterz@infradead.org, x86@kernel.org, rppt@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, dan.j.williams@intel.com Cc: elena.reshetova@intel.com, ira.weiny@intel.com, Rick Edgecombe Subject: [PATCH RFC 03/10] module: Use perm_alloc() for modules Date: Fri, 20 Nov 2020 12:24:19 -0800 Message-Id: <20201120202426.18009-4-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20201120202426.18009-1-rick.p.edgecombe@intel.com> References: <20201120202426.18009-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-State: RFC Today modules uses single allocation for each of the core and init allocation lifetimes. Subsections of these allocations are broken into different permission types. In order to allow arch breakouts to allocate permissions more efficiently, change the module loader to operate on these as separate allocations. Track these broken out subsection allocations in struct module_layout. Drop the unneeded frob_() functions previously used for each permission transition. Since transitioning ro_after_init still requires a flush, instead provide module_map() to handle the initial mapping and ro_after_init transition. Similar to how a flag was stuffed into the upper bits of sh_entsize in order to communicate which (core or init) allocation a section should be placed, create 4 new flags to communicate which permission allocation a subsection should be in. To prevent disturbing any architectures that may have relocation limitiations that require subsections to be close together, layout these perm_allocation's out of a single module_alloc() allocation by default. This way architectures can work as they did before, but they can have the opportunity to do things more efficiently if supported. So this should not have any functional change yet. It is just a change to how the different regions of the module allocations are tracked in module.c such that future patches can actually make the regions separate allocations. Signed-off-by: Rick Edgecombe --- include/linux/module.h | 44 ++-- kernel/module.c | 547 +++++++++++++++++++++++++---------------- 2 files changed, 362 insertions(+), 229 deletions(-) diff --git a/include/linux/module.h b/include/linux/module.h index 6264617bab4d..9964f909d879 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -26,6 +26,7 @@ #include #include #include +#include #include #include @@ -318,23 +319,32 @@ struct mod_tree_node { struct latch_tree_node node; }; -struct module_layout { - /* The actual code + data. */ - void *base; - /* Total size. */ +struct mod_alloc { + struct perm_allocation *alloc; unsigned int size; - /* The size of the executable code. */ - unsigned int text_size; - /* Size of RO section of the module (text+rodata) */ - unsigned int ro_size; - /* Size of RO after init section */ - unsigned int ro_after_init_size; #ifdef CONFIG_MODULES_TREE_LOOKUP struct mod_tree_node mtn; #endif }; +struct module_layout { + /* + * The start of the allocations for arch's that have + * them as contiguous regions. + */ + void *base; + + /* The actual code + data. */ + struct mod_alloc text; + struct mod_alloc ro; + struct mod_alloc ro_after_init; + struct mod_alloc rw; + + /* Total size. */ + unsigned int size; +}; + #ifdef CONFIG_MODULES_TREE_LOOKUP /* Only touch one cacheline for common rbtree-for-core-layout case. */ #define __module_layout_align ____cacheline_aligned @@ -562,19 +572,25 @@ bool is_module_address(unsigned long addr); bool __is_module_percpu_address(unsigned long addr, unsigned long *can_addr); bool is_module_percpu_address(unsigned long addr); bool is_module_text_address(unsigned long addr); +struct perm_allocation *module_get_allocation(struct module *mod, unsigned long addr); +bool module_perm_alloc(struct module_layout *layout); +void module_perm_free(struct module_layout *layout); static inline bool within_module_core(unsigned long addr, const struct module *mod) { - return (unsigned long)mod->core_layout.base <= addr && - addr < (unsigned long)mod->core_layout.base + mod->core_layout.size; + return within_perm_alloc(mod->core_layout.text.alloc, addr) || + within_perm_alloc(mod->core_layout.ro.alloc, addr) || + within_perm_alloc(mod->core_layout.ro_after_init.alloc, addr) || + within_perm_alloc(mod->core_layout.rw.alloc, addr); } static inline bool within_module_init(unsigned long addr, const struct module *mod) { - return (unsigned long)mod->init_layout.base <= addr && - addr < (unsigned long)mod->init_layout.base + mod->init_layout.size; + return within_perm_alloc(mod->init_layout.text.alloc, addr) || + within_perm_alloc(mod->init_layout.ro.alloc, addr) || + within_perm_alloc(mod->init_layout.rw.alloc, addr); } static inline bool within_module(unsigned long addr, const struct module *mod) diff --git a/kernel/module.c b/kernel/module.c index a4fa44a652a7..0b31c44798e2 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -80,6 +80,12 @@ /* If this is set, the section belongs in the init part of the module */ #define INIT_OFFSET_MASK (1UL << (BITS_PER_LONG-1)) +#define TEXT_OFFSET_MASK (1UL << (BITS_PER_LONG-2)) +#define RO_OFFSET_MASK (1UL << (BITS_PER_LONG-3)) +#define RO_AFTER_INIT_OFFSET_MASK (1UL << (BITS_PER_LONG-4)) +#define RW_OFFSET_MASK (1UL << (BITS_PER_LONG-5)) +#define ALL_OFFSET_MASK (INIT_OFFSET_MASK | TEXT_OFFSET_MASK | RO_OFFSET_MASK | \ + RO_AFTER_INIT_OFFSET_MASK | RW_OFFSET_MASK) /* * Mutex protects: @@ -109,16 +115,16 @@ static LLIST_HEAD(init_free_list); static __always_inline unsigned long __mod_tree_val(struct latch_tree_node *n) { - struct module_layout *layout = container_of(n, struct module_layout, mtn.node); + struct mod_alloc *modalloc = container_of(n, struct mod_alloc, mtn.node); - return (unsigned long)layout->base; + return perm_alloc_address(modalloc->alloc); } static __always_inline unsigned long __mod_tree_size(struct latch_tree_node *n) { - struct module_layout *layout = container_of(n, struct module_layout, mtn.node); + struct mod_alloc *modalloc = container_of(n, struct mod_alloc, mtn.node); - return (unsigned long)layout->size; + return (unsigned long)modalloc->size; } static __always_inline bool @@ -170,29 +176,51 @@ static void __mod_tree_remove(struct mod_tree_node *node) latch_tree_erase(&node->node, &mod_tree.root, &mod_tree_ops); } +static void __mod_tree_insert_layout(struct module_layout *layout, struct module *mod) +{ + layout->text.mtn.mod = mod; + layout->ro.mtn.mod = mod; + layout->ro_after_init.mtn.mod = mod; + layout->rw.mtn.mod = mod; + + if (layout->text.alloc) + __mod_tree_insert(&layout->text.mtn); + if (layout->ro.alloc) + __mod_tree_insert(&layout->ro.mtn); + if (layout->ro_after_init.alloc) + __mod_tree_insert(&layout->ro_after_init.mtn); + if (layout->rw.alloc) + __mod_tree_insert(&layout->rw.mtn); +} + +static void __mod_tree_remove_layout(struct module_layout *layout) +{ + __mod_tree_remove(&layout->text.mtn); + __mod_tree_remove(&layout->ro.mtn); + __mod_tree_remove(&layout->ro_after_init.mtn); + __mod_tree_remove(&layout->rw.mtn); +} + /* * These modifications: insert, remove_init and remove; are serialized by the * module_mutex. */ static void mod_tree_insert(struct module *mod) { - mod->core_layout.mtn.mod = mod; - mod->init_layout.mtn.mod = mod; - - __mod_tree_insert(&mod->core_layout.mtn); + __mod_tree_insert_layout(&mod->core_layout, mod); if (mod->init_layout.size) - __mod_tree_insert(&mod->init_layout.mtn); + __mod_tree_insert_layout(&mod->init_layout, mod); } static void mod_tree_remove_init(struct module *mod) { if (mod->init_layout.size) - __mod_tree_remove(&mod->init_layout.mtn); + __mod_tree_remove_layout(&mod->init_layout); } static void mod_tree_remove(struct module *mod) { - __mod_tree_remove(&mod->core_layout.mtn); + __mod_tree_remove_layout(&mod->core_layout); mod_tree_remove_init(mod); } @@ -234,10 +262,16 @@ static struct module *mod_find(unsigned long addr) * Bounds of module text, for speeding up __module_address. * Protected by module_mutex. */ -static void __mod_update_bounds(void *base, unsigned int size) +static void __mod_update_bounds(struct mod_alloc *modalloc) { - unsigned long min = (unsigned long)base; - unsigned long max = min + size; + unsigned long min; + unsigned long max; + + if (!modalloc->alloc) + return; + + min = (unsigned long)perm_alloc_address(modalloc->alloc); + max = min + perm_alloc_size(modalloc->alloc); if (min < module_addr_min) module_addr_min = min; @@ -247,9 +281,14 @@ static void __mod_update_bounds(void *base, unsigned int size) static void mod_update_bounds(struct module *mod) { - __mod_update_bounds(mod->core_layout.base, mod->core_layout.size); - if (mod->init_layout.size) - __mod_update_bounds(mod->init_layout.base, mod->init_layout.size); + __mod_update_bounds(&mod->core_layout.text); + __mod_update_bounds(&mod->core_layout.ro); + __mod_update_bounds(&mod->core_layout.ro_after_init); + __mod_update_bounds(&mod->core_layout.rw); + __mod_update_bounds(&mod->init_layout.text); + __mod_update_bounds(&mod->init_layout.ro); + __mod_update_bounds(&mod->init_layout.ro_after_init); + __mod_update_bounds(&mod->init_layout.rw); } #ifdef CONFIG_KGDB_KDB @@ -1995,100 +2034,48 @@ static void mod_sysfs_teardown(struct module *mod) mod_sysfs_fini(mod); } -/* - * LKM RO/NX protection: protect module's text/ro-data - * from modification and any data from execution. - * - * General layout of module is: - * [text] [read-only-data] [ro-after-init] [writable data] - * text_size -----^ ^ ^ ^ - * ro_size ------------------------| | | - * ro_after_init_size -----------------------------| | - * size -----------------------------------------------------------| - * - * These values are always page-aligned (as is base) - */ - -/* - * Since some arches are moving towards PAGE_KERNEL module allocations instead - * of PAGE_KERNEL_EXEC, keep frob_text() and module_enable_x() outside of the - * CONFIG_STRICT_MODULE_RWX block below because they are needed regardless of - * whether we are strict. - */ -#ifdef CONFIG_ARCH_HAS_STRICT_MODULE_RWX -static void frob_text(const struct module_layout *layout, - int (*set_memory)(unsigned long start, int num_pages)) +#ifdef CONFIG_STRICT_MODULE_RWX +static void map_core_regions(const struct module *mod, + virtual_perm text_perm, + virtual_perm ro_perm, + virtual_perm ro_after_init_perm, + virtual_perm rw_perm) { - BUG_ON((unsigned long)layout->base & (PAGE_SIZE-1)); - BUG_ON((unsigned long)layout->text_size & (PAGE_SIZE-1)); - set_memory((unsigned long)layout->base, - layout->text_size >> PAGE_SHIFT); -} + const struct module_layout *layout = &mod->core_layout; -static void module_enable_x(const struct module *mod) -{ - frob_text(&mod->core_layout, set_memory_x); - frob_text(&mod->init_layout, set_memory_x); + perm_change(layout->text.alloc, text_perm); + perm_change(layout->ro.alloc, ro_perm); + perm_change(layout->ro_after_init.alloc, ro_after_init_perm); + perm_change(layout->rw.alloc, rw_perm); } -#else /* !CONFIG_ARCH_HAS_STRICT_MODULE_RWX */ -static void module_enable_x(const struct module *mod) { } -#endif /* CONFIG_ARCH_HAS_STRICT_MODULE_RWX */ -#ifdef CONFIG_STRICT_MODULE_RWX -static void frob_rodata(const struct module_layout *layout, - int (*set_memory)(unsigned long start, int num_pages)) -{ - BUG_ON((unsigned long)layout->base & (PAGE_SIZE-1)); - BUG_ON((unsigned long)layout->text_size & (PAGE_SIZE-1)); - BUG_ON((unsigned long)layout->ro_size & (PAGE_SIZE-1)); - set_memory((unsigned long)layout->base + layout->text_size, - (layout->ro_size - layout->text_size) >> PAGE_SHIFT); -} -static void frob_ro_after_init(const struct module_layout *layout, - int (*set_memory)(unsigned long start, int num_pages)) +static void map_init_regions(const struct module *mod, + virtual_perm text_perm, virtual_perm ro_perm) { - BUG_ON((unsigned long)layout->base & (PAGE_SIZE-1)); - BUG_ON((unsigned long)layout->ro_size & (PAGE_SIZE-1)); - BUG_ON((unsigned long)layout->ro_after_init_size & (PAGE_SIZE-1)); - set_memory((unsigned long)layout->base + layout->ro_size, - (layout->ro_after_init_size - layout->ro_size) >> PAGE_SHIFT); -} + const struct module_layout *layout = &mod->init_layout; -static void frob_writable_data(const struct module_layout *layout, - int (*set_memory)(unsigned long start, int num_pages)) -{ - BUG_ON((unsigned long)layout->base & (PAGE_SIZE-1)); - BUG_ON((unsigned long)layout->ro_after_init_size & (PAGE_SIZE-1)); - BUG_ON((unsigned long)layout->size & (PAGE_SIZE-1)); - set_memory((unsigned long)layout->base + layout->ro_after_init_size, - (layout->size - layout->ro_after_init_size) >> PAGE_SHIFT); + perm_change(layout->text.alloc, text_perm); + perm_change(layout->ro.alloc, ro_perm); } -static void module_enable_ro(const struct module *mod, bool after_init) +static void module_map(const struct module *mod, bool after_init) { - if (!rodata_enabled) - return; + virtual_perm after_init_perm = PERM_R; - set_vm_flush_reset_perms(mod->core_layout.base); - set_vm_flush_reset_perms(mod->init_layout.base); - frob_text(&mod->core_layout, set_memory_ro); + perm_writable_finish(mod->core_layout.text.alloc); + perm_writable_finish(mod->core_layout.ro.alloc); + perm_writable_finish(mod->init_layout.text.alloc); + perm_writable_finish(mod->init_layout.ro.alloc); - frob_rodata(&mod->core_layout, set_memory_ro); - frob_text(&mod->init_layout, set_memory_ro); - frob_rodata(&mod->init_layout, set_memory_ro); + if (!rodata_enabled) + return; - if (after_init) - frob_ro_after_init(&mod->core_layout, set_memory_ro); -} + if (!after_init) + after_init_perm |= PERM_W; -static void module_enable_nx(const struct module *mod) -{ - frob_rodata(&mod->core_layout, set_memory_nx); - frob_ro_after_init(&mod->core_layout, set_memory_nx); - frob_writable_data(&mod->core_layout, set_memory_nx); - frob_rodata(&mod->init_layout, set_memory_nx); - frob_writable_data(&mod->init_layout, set_memory_nx); + map_core_regions(mod, PERM_RX, PERM_R, after_init_perm, PERM_RW); + map_init_regions(mod, PERM_RX, PERM_R); } static int module_enforce_rwx_sections(Elf_Ehdr *hdr, Elf_Shdr *sechdrs, @@ -2109,8 +2096,7 @@ static int module_enforce_rwx_sections(Elf_Ehdr *hdr, Elf_Shdr *sechdrs, } #else /* !CONFIG_STRICT_MODULE_RWX */ -static void module_enable_nx(const struct module *mod) { } -static void module_enable_ro(const struct module *mod, bool after_init) {} +static void module_map(const struct module *mod, bool after_init) {} static int module_enforce_rwx_sections(Elf_Ehdr *hdr, Elf_Shdr *sechdrs, char *secstrings, struct module *mod) { @@ -2211,6 +2197,8 @@ void __weak module_arch_freeing_init(struct module *mod) { } +void module_perm_free(struct module_layout *layout); + /* Free a module, remove from lists, etc. */ static void free_module(struct module *mod) { @@ -2252,15 +2240,16 @@ static void free_module(struct module *mod) /* This may be empty, but that's OK */ module_arch_freeing_init(mod); - module_memfree(mod->init_layout.base); + module_perm_free(&mod->init_layout); kfree(mod->args); percpu_modfree(mod); /* Free lock-classes; relies on the preceding sync_rcu(). */ - lockdep_free_key_range(mod->core_layout.base, mod->core_layout.size); + lockdep_free_key_range((void *)perm_alloc_address(mod->core_layout.rw.alloc), + mod->core_layout.rw.size); /* Finally, free the core (containing the module structure) */ - module_memfree(mod->core_layout.base); + module_perm_free(&mod->core_layout); } void *__symbol_get(const char *symbol) @@ -2436,10 +2425,56 @@ static long get_offset(struct module *mod, unsigned int *size, *size += arch_mod_section_prepend(mod, section); ret = ALIGN(*size, sechdr->sh_addralign ?: 1); - *size = ret + sechdr->sh_size; + *size = debug_align(ret + sechdr->sh_size); + return ret; } +static inline unsigned int *get_size_from_layout(struct module_layout *layout, + unsigned long offset_mask) +{ + if (offset_mask & TEXT_OFFSET_MASK) + return &layout->text.size; + if (offset_mask & RO_OFFSET_MASK) + return &layout->ro.size; + if (offset_mask & RO_AFTER_INIT_OFFSET_MASK) + return &layout->ro_after_init.size; + return &layout->rw.size; /*RW_OFFSET_MASK*/ +} + +static inline struct perm_allocation *get_alloc_from_layout(struct module_layout *layout, + unsigned long offset_mask) +{ + if (offset_mask & TEXT_OFFSET_MASK) + return layout->text.alloc; + if (offset_mask & RO_OFFSET_MASK) + return layout->ro.alloc; + if (offset_mask & RO_AFTER_INIT_OFFSET_MASK) + return layout->ro_after_init.alloc; + return layout->rw.alloc; /*RW_OFFSET_MASK*/ +} + +static void set_total_size(struct module_layout *layout) +{ + layout->size = debug_align(layout->text.size + layout->ro.size + + layout->ro_after_init.size + layout->rw.size); +} + +struct perm_allocation *module_get_allocation(struct module *mod, unsigned long addr) +{ + struct perm_allocation *allocs[] = { mod->core_layout.text.alloc, mod->core_layout.ro.alloc, + mod->core_layout.ro_after_init.alloc, + mod->core_layout.rw.alloc, mod->init_layout.text.alloc, + mod->init_layout.ro.alloc, mod->init_layout.rw.alloc }; + int i; + + for (i = 0; i < ARRAY_SIZE(allocs); i++) + if (within_perm_alloc(allocs[i], addr)) + return allocs[i]; + + return NULL; +} + /* Lay out the SHF_ALLOC sections in a way not dissimilar to how ld might -- code, read-only data, read-write data, small data. Tally sizes, and place the offsets into sh_entsize fields: high bit means it @@ -2456,6 +2491,7 @@ static void layout_sections(struct module *mod, struct load_info *info) { SHF_WRITE | SHF_ALLOC, ARCH_SHF_SMALL }, { ARCH_SHF_SMALL | SHF_ALLOC, 0 } }; + unsigned int *section_size; unsigned int m, i; for (i = 0; i < info->hdr->e_shnum; i++) @@ -2463,6 +2499,8 @@ static void layout_sections(struct module *mod, struct load_info *info) pr_debug("Core section allocation order:\n"); for (m = 0; m < ARRAY_SIZE(masks); ++m) { + unsigned long offset_mask = (1UL << (BITS_PER_LONG - (m + 2))); + for (i = 0; i < info->hdr->e_shnum; ++i) { Elf_Shdr *s = &info->sechdrs[i]; const char *sname = info->secstrings + s->sh_name; @@ -2472,30 +2510,19 @@ static void layout_sections(struct module *mod, struct load_info *info) || s->sh_entsize != ~0UL || module_init_section(sname)) continue; - s->sh_entsize = get_offset(mod, &mod->core_layout.size, s, i); + section_size = get_size_from_layout(&mod->core_layout, offset_mask); + s->sh_entsize = get_offset(mod, section_size, s, i) | offset_mask; + pr_debug("\t%s\n", sname); } - switch (m) { - case 0: /* executable */ - mod->core_layout.size = debug_align(mod->core_layout.size); - mod->core_layout.text_size = mod->core_layout.size; - break; - case 1: /* RO: text and ro-data */ - mod->core_layout.size = debug_align(mod->core_layout.size); - mod->core_layout.ro_size = mod->core_layout.size; - break; - case 2: /* RO after init */ - mod->core_layout.size = debug_align(mod->core_layout.size); - mod->core_layout.ro_after_init_size = mod->core_layout.size; - break; - case 4: /* whole core */ - mod->core_layout.size = debug_align(mod->core_layout.size); - break; - } } + set_total_size(&mod->core_layout); + pr_debug("Init section allocation order:\n"); for (m = 0; m < ARRAY_SIZE(masks); ++m) { + unsigned long offset_mask = (1UL << (BITS_PER_LONG - (m + 2))); + for (i = 0; i < info->hdr->e_shnum; ++i) { Elf_Shdr *s = &info->sechdrs[i]; const char *sname = info->secstrings + s->sh_name; @@ -2505,31 +2532,15 @@ static void layout_sections(struct module *mod, struct load_info *info) || s->sh_entsize != ~0UL || !module_init_section(sname)) continue; - s->sh_entsize = (get_offset(mod, &mod->init_layout.size, s, i) - | INIT_OFFSET_MASK); + + section_size = get_size_from_layout(&mod->init_layout, offset_mask); + s->sh_entsize = get_offset(mod, section_size, s, i); + s->sh_entsize |= INIT_OFFSET_MASK | offset_mask; pr_debug("\t%s\n", sname); } - switch (m) { - case 0: /* executable */ - mod->init_layout.size = debug_align(mod->init_layout.size); - mod->init_layout.text_size = mod->init_layout.size; - break; - case 1: /* RO: text and ro-data */ - mod->init_layout.size = debug_align(mod->init_layout.size); - mod->init_layout.ro_size = mod->init_layout.size; - break; - case 2: - /* - * RO after init doesn't apply to init_layout (only - * core_layout), so it just takes the value of ro_size. - */ - mod->init_layout.ro_after_init_size = mod->init_layout.ro_size; - break; - case 4: /* whole init */ - mod->init_layout.size = debug_align(mod->init_layout.size); - break; - } } + + set_total_size(&mod->init_layout); } static void set_license(struct module *mod, const char *license) @@ -2724,7 +2735,7 @@ static void layout_symtab(struct module *mod, struct load_info *info) /* Put symbol section at end of init part of module. */ symsect->sh_flags |= SHF_ALLOC; - symsect->sh_entsize = get_offset(mod, &mod->init_layout.size, symsect, + symsect->sh_entsize = get_offset(mod, &mod->init_layout.rw.size, symsect, info->index.sym) | INIT_OFFSET_MASK; pr_debug("\t%s\n", info->secstrings + symsect->sh_name); @@ -2742,27 +2753,28 @@ static void layout_symtab(struct module *mod, struct load_info *info) } /* Append room for core symbols at end of core part. */ - info->symoffs = ALIGN(mod->core_layout.size, symsect->sh_addralign ?: 1); - info->stroffs = mod->core_layout.size = info->symoffs + ndst * sizeof(Elf_Sym); - mod->core_layout.size += strtab_size; - info->core_typeoffs = mod->core_layout.size; + info->symoffs = ALIGN(mod->core_layout.rw.size, symsect->sh_addralign ?: 1); + info->stroffs = mod->core_layout.rw.size = info->symoffs + ndst * sizeof(Elf_Sym); + mod->core_layout.rw.size += strtab_size; + info->core_typeoffs = mod->core_layout.rw.size; mod->core_layout.size += ndst * sizeof(char); - mod->core_layout.size = debug_align(mod->core_layout.size); /* Put string table section at end of init part of module. */ strsect->sh_flags |= SHF_ALLOC; - strsect->sh_entsize = get_offset(mod, &mod->init_layout.size, strsect, + strsect->sh_entsize = get_offset(mod, &mod->init_layout.rw.size, strsect, info->index.str) | INIT_OFFSET_MASK; pr_debug("\t%s\n", info->secstrings + strsect->sh_name); /* We'll tack temporary mod_kallsyms on the end. */ - mod->init_layout.size = ALIGN(mod->init_layout.size, - __alignof__(struct mod_kallsyms)); - info->mod_kallsyms_init_off = mod->init_layout.size; - mod->init_layout.size += sizeof(struct mod_kallsyms); - info->init_typeoffs = mod->init_layout.size; - mod->init_layout.size += nsrc * sizeof(char); - mod->init_layout.size = debug_align(mod->init_layout.size); + mod->init_layout.rw.size = ALIGN(mod->init_layout.rw.size, + __alignof__(struct mod_kallsyms)); + info->mod_kallsyms_init_off = mod->init_layout.rw.size; + mod->init_layout.rw.size += sizeof(struct mod_kallsyms); + info->init_typeoffs = mod->init_layout.rw.size; + mod->init_layout.rw.size += nsrc * sizeof(char); + + set_total_size(&mod->core_layout); + set_total_size(&mod->init_layout); } /* @@ -2777,23 +2789,25 @@ static void add_kallsyms(struct module *mod, const struct load_info *info) Elf_Sym *dst; char *s; Elf_Shdr *symsec = &info->sechdrs[info->index.sym]; + void *core_rw = (void *)perm_alloc_address(mod->core_layout.rw.alloc); + void *init_rw = (void *)perm_alloc_address(mod->init_layout.rw.alloc); /* Set up to point into init section. */ - mod->kallsyms = mod->init_layout.base + info->mod_kallsyms_init_off; + mod->kallsyms = init_rw + info->mod_kallsyms_init_off; mod->kallsyms->symtab = (void *)symsec->sh_addr; mod->kallsyms->num_symtab = symsec->sh_size / sizeof(Elf_Sym); /* Make sure we get permanent strtab: don't use info->strtab. */ mod->kallsyms->strtab = (void *)info->sechdrs[info->index.str].sh_addr; - mod->kallsyms->typetab = mod->init_layout.base + info->init_typeoffs; + mod->kallsyms->typetab = init_rw + info->init_typeoffs; /* * Now populate the cut down core kallsyms for after init * and set types up while we still have access to sections. */ - mod->core_kallsyms.symtab = dst = mod->core_layout.base + info->symoffs; - mod->core_kallsyms.strtab = s = mod->core_layout.base + info->stroffs; - mod->core_kallsyms.typetab = mod->core_layout.base + info->core_typeoffs; + mod->core_kallsyms.symtab = dst = core_rw + info->symoffs; + mod->core_kallsyms.strtab = s = core_rw + info->stroffs; + mod->core_kallsyms.typetab = core_rw + info->core_typeoffs; src = mod->kallsyms->symtab; for (ndst = i = 0; i < mod->kallsyms->num_symtab; i++) { mod->kallsyms->typetab[i] = elf_type(src + i, info); @@ -2845,6 +2859,110 @@ bool __weak module_init_section(const char *name) return strstarts(name, ".init"); } +static void set_perm_alloc(struct perm_allocation *alloc, struct vm_struct *area, + unsigned long offset, unsigned long size) +{ + if (!alloc) + return; + + alloc->area = area; + alloc->offset = offset; + alloc->size = size; +} + +/* + * Layout perm_allocs as they would have been by a straight module_alloc call in case any arch's + * care about that. + */ +bool __weak module_perm_alloc(struct module_layout *layout) +{ + void *addr = module_alloc(layout->size); + gfp_t flags = GFP_KERNEL | __GFP_ZERO; + struct vm_struct *area; + + layout->text.alloc = NULL; + layout->ro.alloc = NULL; + layout->ro_after_init.alloc = NULL; + layout->rw.alloc = NULL; + + if (!addr) + return false; + + layout->base = addr; + + if (IS_ENABLED(CONFIG_MMU)) { + area = find_vm_area(addr); + } else { + area = kmalloc(sizeof(*area), flags); + if (!area) { + module_memfree(addr); + return NULL; + } + area->addr = addr; + } + + /* + * This is a bit of an awkward hack. modules.c now expects all of the different permission + * ranges to be in separate allocations, but we want to make this perm_allocation transition + * without disturbing any arch, which may have unknown relocation rules. So to make sure + * there is no functional change, allocate using the arch's module_alloc() and lay fake + * perm_allocations over the top of it. This way all the sections would be laid out exactly + * as before struct special_alloc was introduced until an arch adds their own + * module_perm_alloc(). + */ + layout->text.alloc = kmalloc(sizeof(*layout->text.alloc), flags); + layout->ro.alloc = kmalloc(sizeof(*layout->ro.alloc), flags); + layout->ro_after_init.alloc = kmalloc(sizeof(*layout->ro_after_init.alloc), flags); + layout->rw.alloc = kmalloc(sizeof(*layout->rw.alloc), flags); + if (!(layout->text.alloc && layout->ro.alloc && layout->ro_after_init.alloc && + layout->rw.alloc)) { + module_perm_free(layout); + layout->text.alloc = NULL; + layout->ro.alloc = NULL; + layout->ro_after_init.alloc = NULL; + layout->rw.alloc = NULL; + return false; + } + + set_perm_alloc(layout->text.alloc, area, 0, layout->text.size); + set_perm_alloc(layout->ro.alloc, area, layout->text.size, layout->ro.size); + set_perm_alloc(layout->ro_after_init.alloc, area, layout->ro.alloc->offset + + layout->ro.size, layout->ro_after_init.size); + set_perm_alloc(layout->rw.alloc, area, + layout->ro_after_init.alloc->offset + layout->ro_after_init.size, + layout->rw.size); + + set_vm_flush_reset_perms(addr); + + return true; +} + +static void unset_layout_allocs(struct module_layout *layout) +{ + layout->text.alloc = NULL; + layout->ro.alloc = NULL; + layout->ro_after_init.alloc = NULL; + layout->rw.alloc = NULL; +} + +void __weak module_perm_free(struct module_layout *layout) +{ + void *addr; + + if (!layout->text.alloc) + return; + addr = (void *)perm_alloc_address(layout->text.alloc); + + /* Clean up the fake perm_allocation's in a special way */ + kfree(layout->text.alloc); + kfree(layout->ro.alloc); + kfree(layout->ro_after_init.alloc); + kfree(layout->rw.alloc); + unset_layout_allocs(layout); + + module_memfree(addr); +} + bool __weak module_exit_section(const char *name) { return strstarts(name, ".exit"); @@ -3306,54 +3424,52 @@ static int find_module_sections(struct module *mod, struct load_info *info) static int move_module(struct module *mod, struct load_info *info) { int i; - void *ptr; - /* Do the allocs. */ - ptr = module_alloc(mod->core_layout.size); /* * The pointer to this block is stored in the module structure * which is inside the block. Just mark it as not being a * leak. */ - kmemleak_not_leak(ptr); - if (!ptr) + if (!module_perm_alloc(&mod->core_layout)) return -ENOMEM; - memset(ptr, 0, mod->core_layout.size); - mod->core_layout.base = ptr; + perm_memset(mod->core_layout.text.alloc, 0xcc); + perm_memset(mod->core_layout.ro.alloc, 0); + perm_memset(mod->core_layout.ro_after_init.alloc, 0); + perm_memset(mod->core_layout.rw.alloc, 0); if (mod->init_layout.size) { - ptr = module_alloc(mod->init_layout.size); /* * The pointer to this block is stored in the module structure * which is inside the block. This block doesn't need to be * scanned as it contains data and code that will be freed * after the module is initialized. */ - kmemleak_ignore(ptr); - if (!ptr) { - module_memfree(mod->core_layout.base); + if (!module_perm_alloc(&mod->init_layout)) return -ENOMEM; - } - memset(ptr, 0, mod->init_layout.size); - mod->init_layout.base = ptr; - } else - mod->init_layout.base = NULL; + perm_memset(mod->init_layout.text.alloc, 0xcc); + perm_memset(mod->init_layout.ro.alloc, 0); + perm_memset(mod->init_layout.rw.alloc, 0); + } else { + unset_layout_allocs(&mod->init_layout); + } /* Transfer each section which specifies SHF_ALLOC */ pr_debug("final section addresses:\n"); for (i = 0; i < info->hdr->e_shnum; i++) { void *dest; + struct perm_allocation *alloc; + Elf_Shdr *shdr = &info->sechdrs[i]; if (!(shdr->sh_flags & SHF_ALLOC)) continue; if (shdr->sh_entsize & INIT_OFFSET_MASK) - dest = mod->init_layout.base - + (shdr->sh_entsize & ~INIT_OFFSET_MASK); + alloc = get_alloc_from_layout(&mod->init_layout, shdr->sh_entsize); else - dest = mod->core_layout.base + shdr->sh_entsize; + alloc = get_alloc_from_layout(&mod->core_layout, shdr->sh_entsize); + dest = (void *)perm_alloc_address(alloc) + (shdr->sh_entsize & ~ALL_OFFSET_MASK); if (shdr->sh_type != SHT_NOBITS) memcpy(dest, (void *)shdr->sh_addr, shdr->sh_size); @@ -3414,12 +3530,13 @@ static void flush_module_icache(const struct module *mod) * Do it before processing of module parameters, so the module * can provide parameter accessor functions of its own. */ - if (mod->init_layout.base) - flush_icache_range((unsigned long)mod->init_layout.base, - (unsigned long)mod->init_layout.base - + mod->init_layout.size); - flush_icache_range((unsigned long)mod->core_layout.base, - (unsigned long)mod->core_layout.base + mod->core_layout.size); + if (mod->init_layout.text.alloc) + flush_icache_range((unsigned long)perm_alloc_address(mod->init_layout.text.alloc), + (unsigned long)perm_alloc_address(mod->init_layout.text.alloc) + + mod->init_layout.text.size); + flush_icache_range((unsigned long)perm_alloc_address(mod->core_layout.text.alloc), + (unsigned long)perm_alloc_address(mod->init_layout.text.alloc) + + mod->core_layout.text.size); } int __weak module_frob_arch_sections(Elf_Ehdr *hdr, @@ -3515,8 +3632,8 @@ static void module_deallocate(struct module *mod, struct load_info *info) { percpu_modfree(mod); module_arch_freeing_init(mod); - module_memfree(mod->init_layout.base); - module_memfree(mod->core_layout.base); + module_perm_free(&mod->init_layout); + module_perm_free(&mod->core_layout); } int __weak module_finalize(const Elf_Ehdr *hdr, @@ -3576,7 +3693,7 @@ static void do_mod_ctors(struct module *mod) /* For freeing module_init on success, in case kallsyms traversing */ struct mod_initfree { struct llist_node node; - void *module_init; + struct module_layout layout; }; static void do_free_init(struct work_struct *w) @@ -3590,7 +3707,7 @@ static void do_free_init(struct work_struct *w) llist_for_each_safe(pos, n, list) { initfree = container_of(pos, struct mod_initfree, node); - module_memfree(initfree->module_init); + module_perm_free(&initfree->layout); kfree(initfree); } } @@ -3606,12 +3723,11 @@ static noinline int do_init_module(struct module *mod) int ret = 0; struct mod_initfree *freeinit; - freeinit = kmalloc(sizeof(*freeinit), GFP_KERNEL); + freeinit = kmalloc(sizeof(*freeinit) * 4, GFP_KERNEL); if (!freeinit) { ret = -ENOMEM; goto fail; } - freeinit->module_init = mod->init_layout.base; /* * We want to find out whether @mod uses async during init. Clear @@ -3659,8 +3775,9 @@ static noinline int do_init_module(struct module *mod) if (!mod->async_probe_requested && (current->flags & PF_USED_ASYNC)) async_synchronize_full(); - ftrace_free_mem(mod, mod->init_layout.base, mod->init_layout.base + - mod->init_layout.size); + ftrace_free_mem(mod, (void *)perm_alloc_address(mod->init_layout.text.alloc), + (void *)perm_alloc_address(mod->init_layout.text.alloc) + + perm_alloc_size(mod->init_layout.text.alloc)); mutex_lock(&module_mutex); /* Drop initial reference. */ module_put(mod); @@ -3669,14 +3786,10 @@ static noinline int do_init_module(struct module *mod) /* Switch to core kallsyms now init is done: kallsyms may be walking! */ rcu_assign_pointer(mod->kallsyms, &mod->core_kallsyms); #endif - module_enable_ro(mod, true); + module_map(mod, true); mod_tree_remove_init(mod); module_arch_freeing_init(mod); - mod->init_layout.base = NULL; - mod->init_layout.size = 0; - mod->init_layout.ro_size = 0; - mod->init_layout.ro_after_init_size = 0; - mod->init_layout.text_size = 0; + /* * We want to free module_init, but be aware that kallsyms may be * walking this with preempt disabled. In all the failure paths, we @@ -3690,6 +3803,9 @@ static noinline int do_init_module(struct module *mod) * be cleaned up needs to sync with the queued work - ie * rcu_barrier() */ + freeinit->layout = mod->init_layout; + unset_layout_allocs(&mod->init_layout); + mod->init_layout.size = 0; if (llist_add(&freeinit->node, &init_free_list)) schedule_work(&init_free_wq); @@ -3775,9 +3891,7 @@ static int complete_formation(struct module *mod, struct load_info *info) /* This relies on module_mutex for list integrity. */ module_bug_finalize(info->hdr, info->sechdrs, mod); - module_enable_ro(mod, false); - module_enable_nx(mod); - module_enable_x(mod); + module_map(mod, false); /* Mark state as coming so strong_try_module_get() ignores us, * but kallsyms etc. can see us. */ @@ -4018,7 +4132,8 @@ static int load_module(struct load_info *info, const char __user *uargs, mutex_unlock(&module_mutex); free_module: /* Free lock-classes; relies on the preceding sync_rcu() */ - lockdep_free_key_range(mod->core_layout.base, mod->core_layout.size); + lockdep_free_key_range((void *)perm_alloc_address(mod->core_layout.rw.alloc), + mod->core_layout.text.size); module_deallocate(mod, info); free_copy: @@ -4110,9 +4225,11 @@ static const char *find_kallsyms_symbol(struct module *mod, /* At worse, next value is at end of module */ if (within_module_init(addr, mod)) - nextval = (unsigned long)mod->init_layout.base+mod->init_layout.text_size; + nextval = perm_alloc_address(mod->init_layout.text.alloc) + + mod->init_layout.text.size; else - nextval = (unsigned long)mod->core_layout.base+mod->core_layout.text_size; + nextval = perm_alloc_address(mod->core_layout.text.alloc) + + mod->core_layout.text.size; bestval = kallsyms_symbol_value(&kallsyms->symtab[best]); @@ -4404,7 +4521,7 @@ static int m_show(struct seq_file *m, void *p) mod->state == MODULE_STATE_COMING ? "Loading" : "Live"); /* Used by oprofile and other similar tools. */ - value = m->private ? NULL : mod->core_layout.base; + value = m->private ? NULL : (void *)perm_alloc_address(mod->core_layout.text.alloc); seq_printf(m, " 0x%px", value); /* Taints info */ @@ -4524,11 +4641,8 @@ struct module *__module_address(unsigned long addr) module_assert_mutex_or_preempt(); mod = mod_find(addr); - if (mod) { - BUG_ON(!within_module(addr, mod)); - if (mod->state == MODULE_STATE_UNFORMED) - mod = NULL; - } + BUG_ON(mod && !within_module(addr, mod)); + return mod; } @@ -4562,9 +4676,12 @@ struct module *__module_text_address(unsigned long addr) { struct module *mod = __module_address(addr); if (mod) { + void *init_text = (void *)perm_alloc_address(mod->init_layout.text.alloc); + void *core_text = (void *)perm_alloc_address(mod->init_layout.text.alloc); + /* Make sure it's within the text section. */ - if (!within(addr, mod->init_layout.base, mod->init_layout.text_size) - && !within(addr, mod->core_layout.base, mod->core_layout.text_size)) + if (!within(addr, init_text, mod->init_layout.text.size) && + !within(addr, core_text, mod->core_layout.text.size)) mod = NULL; } return mod; From patchwork Fri Nov 20 20:24:20 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Edgecombe, Rick P" X-Patchwork-Id: 11922165 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 781BEC64E7C for ; Fri, 20 Nov 2020 20:30:06 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 24E2524197 for ; Fri, 20 Nov 2020 20:30:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731032AbgKTUaF (ORCPT ); Fri, 20 Nov 2020 15:30:05 -0500 Received: from mga07.intel.com ([134.134.136.100]:10351 "EHLO mga07.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730856AbgKTUaE (ORCPT ); Fri, 20 Nov 2020 15:30:04 -0500 IronPort-SDR: ZoWTypuqFe5meE/xGJe+L1iS6t6cyzZtCxRfH+plddRPwKrovaPxv4IZ+7bPQkQXtgqXe0hIB2 tMg1pk0kGWgg== X-IronPort-AV: E=McAfee;i="6000,8403,9811"; a="235683289" X-IronPort-AV: E=Sophos;i="5.78,357,1599548400"; d="scan'208";a="235683289" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Nov 2020 12:30:02 -0800 IronPort-SDR: Nejog16WYJC4xxerTZ+ibwwHr1kTjU4E3AjULdju3ILgQzs5Mm8WYSfWQjfJf8KUA0rzVfOlbJ 6FznREPW5cog== X-IronPort-AV: E=Sophos;i="5.78,357,1599548400"; d="scan'208";a="342163294" Received: from rpedgeco-mobl.amr.corp.intel.com (HELO localhost.intel.com) ([10.209.105.214]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Nov 2020 12:30:02 -0800 From: Rick Edgecombe To: akpm@linux-foundation.org, jeyu@kernel.org, bpf@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net, luto@kernel.org, dave.hansen@linux.intel.com, peterz@infradead.org, x86@kernel.org, rppt@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, dan.j.williams@intel.com Cc: elena.reshetova@intel.com, ira.weiny@intel.com, Rick Edgecombe Subject: [PATCH RFC 04/10] module: Support separate writable allocation Date: Fri, 20 Nov 2020 12:24:20 -0800 Message-Id: <20201120202426.18009-5-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20201120202426.18009-1-rick.p.edgecombe@intel.com> References: <20201120202426.18009-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-State: RFC The perm_alloc interface supports architectures to direct writes intended for an allocation to a separate writable staging area before a mapping is made live. In order to support this, change modules to write to the address provided by perm_writable_addr(). Currently this is the same address of the final allocation, so this patch should not create any functional change yet. To facilitate re-direction to separate writable staging areas, create a helper module_adjust_writable_addr(). This function will return an allocation's writable address if the parameter address is from a module that is in the process of being loaded. If the address does not meet that criteria, simply return the address passed in. This helper, while a little heavy weight, will allow callers in upcoming patches to simply retrieve the writable address without context of what module is being loaded. Signed-off-by: Rick Edgecombe --- include/linux/module.h | 22 ++++++++++++++++++++++ kernel/module.c | 14 +++++++++----- 2 files changed, 31 insertions(+), 5 deletions(-) diff --git a/include/linux/module.h b/include/linux/module.h index 9964f909d879..32dd22b2a38a 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -576,6 +576,23 @@ struct perm_allocation *module_get_allocation(struct module *mod, unsigned long bool module_perm_alloc(struct module_layout *layout); void module_perm_free(struct module_layout *layout); +static inline void *module_adjust_writable_addr(void *addr) +{ + unsigned long laddr = (unsigned long)addr; + struct module *mod; + + mutex_lock(&module_mutex); + mod = __module_address(laddr); + if (!mod) { + mutex_unlock(&module_mutex); + return addr; + } + mutex_unlock(&module_mutex); + /* The module shouldn't be going away if someone is trying to write to it */ + + return (void *)perm_writable_addr(module_get_allocation(mod, laddr), laddr); +} + static inline bool within_module_core(unsigned long addr, const struct module *mod) { @@ -853,6 +870,11 @@ void *dereference_module_function_descriptor(struct module *mod, void *ptr) return ptr; } +static inline void *module_adjust_writable_addr(void *addr) +{ + return addr; +} + #endif /* CONFIG_MODULES */ #ifdef CONFIG_SYSFS diff --git a/kernel/module.c b/kernel/module.c index 0b31c44798e2..d0afedd36cea 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -3457,7 +3457,7 @@ static int move_module(struct module *mod, struct load_info *info) /* Transfer each section which specifies SHF_ALLOC */ pr_debug("final section addresses:\n"); for (i = 0; i < info->hdr->e_shnum; i++) { - void *dest; + void *dest, *wdest; struct perm_allocation *alloc; Elf_Shdr *shdr = &info->sechdrs[i]; @@ -3470,9 +3470,10 @@ static int move_module(struct module *mod, struct load_info *info) else alloc = get_alloc_from_layout(&mod->core_layout, shdr->sh_entsize); dest = (void *)perm_alloc_address(alloc) + (shdr->sh_entsize & ~ALL_OFFSET_MASK); + wdest = (void *)perm_writable_addr(alloc, (unsigned long)dest); if (shdr->sh_type != SHT_NOBITS) - memcpy(dest, (void *)shdr->sh_addr, shdr->sh_size); + memcpy(wdest, (void *)shdr->sh_addr, shdr->sh_size); /* Update sh_addr to point to copy in image. */ shdr->sh_addr = (unsigned long)dest; pr_debug("\t0x%lx %s\n", @@ -3645,12 +3646,15 @@ int __weak module_finalize(const Elf_Ehdr *hdr, static int post_relocation(struct module *mod, const struct load_info *info) { + struct exception_table_entry *extable_writ = module_adjust_writable_addr(mod->extable); + void *pcpu = (void *)info->sechdrs[info->index.pcpu].sh_addr; + void *percpu_writ = module_adjust_writable_addr(pcpu); + /* Sort exception table now relocations are done. */ - sort_extable(mod->extable, mod->extable + mod->num_exentries); + sort_extable(extable_writ, mod->extable + mod->num_exentries); /* Copy relocated percpu area over. */ - percpu_modcopy(mod, (void *)info->sechdrs[info->index.pcpu].sh_addr, - info->sechdrs[info->index.pcpu].sh_size); + percpu_modcopy(mod, percpu_writ, info->sechdrs[info->index.pcpu].sh_size); /* Setup kallsyms-specific fields. */ add_kallsyms(mod, info); From patchwork Fri Nov 20 20:24:21 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Edgecombe, Rick P" X-Patchwork-Id: 11922295 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 50870C64E8A for ; Fri, 20 Nov 2020 20:31:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 056F922464 for ; Fri, 20 Nov 2020 20:31:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731136AbgKTUa1 (ORCPT ); Fri, 20 Nov 2020 15:30:27 -0500 Received: from mga07.intel.com ([134.134.136.100]:10348 "EHLO mga07.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731017AbgKTUaF (ORCPT ); Fri, 20 Nov 2020 15:30:05 -0500 IronPort-SDR: Y/Mh/T9GvdCZFtMqTf6wPcjw49m4asX23QB7wCRKt3lTlIfgQtQ6JH5aAcjo1AV70V4QjxeX4K inuBMoOtCBjQ== X-IronPort-AV: E=McAfee;i="6000,8403,9811"; a="235683296" X-IronPort-AV: E=Sophos;i="5.78,357,1599548400"; d="scan'208";a="235683296" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Nov 2020 12:30:03 -0800 IronPort-SDR: II4ekpkhOvKaWBqJedqpw9lKC/FogA54M5cqtk2mIsaXaaDR8zAXkVs+TgD+DICnZJ3XUcjpCg sq9q0khmZD0w== X-IronPort-AV: E=Sophos;i="5.78,357,1599548400"; d="scan'208";a="342163300" Received: from rpedgeco-mobl.amr.corp.intel.com (HELO localhost.intel.com) ([10.209.105.214]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Nov 2020 12:30:02 -0800 From: Rick Edgecombe To: akpm@linux-foundation.org, jeyu@kernel.org, bpf@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net, luto@kernel.org, dave.hansen@linux.intel.com, peterz@infradead.org, x86@kernel.org, rppt@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, dan.j.williams@intel.com Cc: elena.reshetova@intel.com, ira.weiny@intel.com, Rick Edgecombe Subject: [PATCH RFC 05/10] x86/modules: Use real perm_allocations Date: Fri, 20 Nov 2020 12:24:21 -0800 Message-Id: <20201120202426.18009-6-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20201120202426.18009-1-rick.p.edgecombe@intel.com> References: <20201120202426.18009-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-State: RFC x86 relocations require all of the text sections to be within 2GB of eachother. So as long as the allocations are somewhere in the module area, relocations can be applied. So relax the restriction of having the module regions for a module core or init area be virtually contiguous. Also, apply relocations at the writable address of the perm_allocation to support a future implementation that has the writable address in a different allocation. Signed-off-by: Rick Edgecombe --- arch/x86/kernel/module.c | 84 +++++++++++++++++++++++++++++++++------- 1 file changed, 71 insertions(+), 13 deletions(-) diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c index 34b153cbd4ac..7b369f5ffdb7 100644 --- a/arch/x86/kernel/module.c +++ b/arch/x86/kernel/module.c @@ -85,6 +85,58 @@ void *module_alloc(unsigned long size) return p; } +bool module_perm_alloc(struct module_layout *layout) +{ + unsigned long vstart = MODULES_VADDR + get_module_load_offset(); + unsigned long vend = MODULES_END; + unsigned int text_page_cnt = PAGE_ALIGN(layout->text.size) >> PAGE_SHIFT; + unsigned int ro_page_cnt = PAGE_ALIGN(layout->ro.size) >> PAGE_SHIFT; + unsigned int ro_after_init_page_cnt = PAGE_ALIGN(layout->ro_after_init.size) >> PAGE_SHIFT; + unsigned int rw_page_cnt = PAGE_ALIGN(layout->rw.size) >> PAGE_SHIFT; + + layout->text.alloc = NULL; + layout->ro.alloc = NULL; + layout->ro_after_init.alloc = NULL; + layout->rw.alloc = NULL; + + layout->text.alloc = perm_alloc(vstart, vend, text_page_cnt, PERM_RX); + if (!layout->text.alloc && layout->text.size) + goto out; + + layout->ro.alloc = perm_alloc(vstart, vend, ro_page_cnt, PERM_R); + if (!layout->ro.alloc && layout->ro.size) + goto text_free_out; + + layout->ro_after_init.alloc = perm_alloc(vstart, vend, ro_after_init_page_cnt, PERM_RW); + if (!layout->ro_after_init.alloc && layout->ro_after_init.size) + goto ro_free_out; + + layout->rw.alloc = perm_alloc(vstart, vend, rw_page_cnt, PERM_RW); + if (!layout->rw.alloc && layout->rw.size) + goto ro_after_init_out; + + return true; +ro_after_init_out: + perm_free(layout->ro_after_init.alloc); + layout->ro_after_init.alloc = NULL; +ro_free_out: + perm_free(layout->ro.alloc); + layout->ro.alloc = NULL; +text_free_out: + perm_free(layout->text.alloc); + layout->text.alloc = NULL; +out: + return false; +} + +void module_perm_free(struct module_layout *layout) +{ + perm_free(layout->text.alloc); + perm_free(layout->ro.alloc); + perm_free(layout->ro_after_init.alloc); + perm_free(layout->rw.alloc); +} + #ifdef CONFIG_X86_32 int apply_relocate(Elf32_Shdr *sechdrs, const char *strtab, @@ -134,9 +186,11 @@ static int __apply_relocate_add(Elf64_Shdr *sechdrs, void *(*write)(void *dest, const void *src, size_t len)) { unsigned int i; + struct perm_allocation *alloc; Elf64_Rela *rel = (void *)sechdrs[relsec].sh_addr; Elf64_Sym *sym; void *loc; + void *writable_loc; u64 val; DEBUGP("Applying relocate section %u to %u\n", @@ -146,6 +200,9 @@ static int __apply_relocate_add(Elf64_Shdr *sechdrs, loc = (void *)sechdrs[sechdrs[relsec].sh_info].sh_addr + rel[i].r_offset; + alloc = module_get_allocation(me, (unsigned long)loc); + writable_loc = (void *)perm_writable_addr(alloc, (unsigned long)loc); + /* This is the symbol it is referring to. Note that all undefined symbols have been resolved. */ sym = (Elf64_Sym *)sechdrs[symindex].sh_addr @@ -161,40 +218,40 @@ static int __apply_relocate_add(Elf64_Shdr *sechdrs, case R_X86_64_NONE: break; case R_X86_64_64: - if (*(u64 *)loc != 0) + if (*(u64 *)writable_loc != 0) goto invalid_relocation; - write(loc, &val, 8); + write(writable_loc, &val, 8); break; case R_X86_64_32: - if (*(u32 *)loc != 0) + if (*(u32 *)writable_loc != 0) goto invalid_relocation; - write(loc, &val, 4); - if (val != *(u32 *)loc) + write(writable_loc, &val, 4); + if (val != *(u32 *)writable_loc) goto overflow; break; case R_X86_64_32S: - if (*(s32 *)loc != 0) + if (*(s32 *)writable_loc != 0) goto invalid_relocation; - write(loc, &val, 4); - if ((s64)val != *(s32 *)loc) + write(writable_loc, &val, 4); + if ((s64)val != *(s32 *)writable_loc) goto overflow; break; case R_X86_64_PC32: case R_X86_64_PLT32: - if (*(u32 *)loc != 0) + if (*(u32 *)writable_loc != 0) goto invalid_relocation; val -= (u64)loc; - write(loc, &val, 4); + write(writable_loc, &val, 4); #if 0 - if ((s64)val != *(s32 *)loc) + if ((s64)val != *(s32 *)writable_loc) goto overflow; #endif break; case R_X86_64_PC64: - if (*(u64 *)loc != 0) + if (*(u64 *)writable_loc != 0) goto invalid_relocation; val -= (u64)loc; - write(loc, &val, 8); + write(writable_loc, &val, 8); break; default: pr_err("%s: Unknown rela relocation: %llu\n", @@ -273,6 +330,7 @@ int module_finalize(const Elf_Ehdr *hdr, void *aseg = (void *)alt->sh_addr; apply_alternatives(aseg, aseg + alt->sh_size); } + if (locks && text) { void *lseg = (void *)locks->sh_addr; void *tseg = (void *)text->sh_addr; From patchwork Fri Nov 20 20:24:22 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Edgecombe, Rick P" X-Patchwork-Id: 11922287 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C7D5DC6379F for ; Fri, 20 Nov 2020 20:31:21 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 92CEE22464 for ; Fri, 20 Nov 2020 20:31:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731082AbgKTUaO (ORCPT ); Fri, 20 Nov 2020 15:30:14 -0500 Received: from mga07.intel.com ([134.134.136.100]:10342 "EHLO mga07.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731023AbgKTUaF (ORCPT ); Fri, 20 Nov 2020 15:30:05 -0500 IronPort-SDR: K2d95Nu4mp/a4670UE8IY5Wnz0GTiZGFqnjXwYRXAq4ZHQKzykQnpYSSNXJsTTAR2rc3xH0g/1 OzZRtPm8nknQ== X-IronPort-AV: E=McAfee;i="6000,8403,9811"; a="235683303" X-IronPort-AV: E=Sophos;i="5.78,357,1599548400"; d="scan'208";a="235683303" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Nov 2020 12:30:03 -0800 IronPort-SDR: E3Uc8tOndvBdwSPq0RRlnOjyyziKbRbBuO7nCsOiNMuCDPsSgn1Bym+Sc3Bcy/sFyeXqTRpAyd 3bo/AIxGPz6A== X-IronPort-AV: E=Sophos;i="5.78,357,1599548400"; d="scan'208";a="342163305" Received: from rpedgeco-mobl.amr.corp.intel.com (HELO localhost.intel.com) ([10.209.105.214]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Nov 2020 12:30:03 -0800 From: Rick Edgecombe To: akpm@linux-foundation.org, jeyu@kernel.org, bpf@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net, luto@kernel.org, dave.hansen@linux.intel.com, peterz@infradead.org, x86@kernel.org, rppt@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, dan.j.williams@intel.com Cc: elena.reshetova@intel.com, ira.weiny@intel.com, Rick Edgecombe Subject: [PATCH RFC 06/10] x86/alternatives: Handle perm_allocs for modules Date: Fri, 20 Nov 2020 12:24:22 -0800 Message-Id: <20201120202426.18009-7-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20201120202426.18009-1-rick.p.edgecombe@intel.com> References: <20201120202426.18009-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-State: RFC Modules being loaded using perm_allocs may have a separate writable address. Handle this case in alternatives for operations called during module loading. Signed-off-by: Rick Edgecombe --- arch/x86/kernel/alternative.c | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 2400ad62f330..e6d8a696540b 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -373,7 +373,7 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start, struct alt_instr *end) { struct alt_instr *a; - u8 *instr, *replacement; + u8 *instr, *writ_instr, *replacement, *writ_replacement; u8 insn_buff[MAX_PATCH_LEN]; DPRINTK("alt table %px, -> %px", start, end); @@ -391,11 +391,13 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start, instr = (u8 *)&a->instr_offset + a->instr_offset; replacement = (u8 *)&a->repl_offset + a->repl_offset; + writ_instr = (u8 *)module_adjust_writable_addr(instr); + writ_replacement = (u8 *)module_adjust_writable_addr(replacement); BUG_ON(a->instrlen > sizeof(insn_buff)); BUG_ON(a->cpuid >= (NCAPINTS + NBUGINTS) * 32); if (!boot_cpu_has(a->cpuid)) { if (a->padlen > 1) - optimize_nops(a, instr); + optimize_nops(a, writ_instr); continue; } @@ -403,13 +405,13 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start, DPRINTK("feat: %d*32+%d, old: (%pS (%px) len: %d), repl: (%px, len: %d), pad: %d", a->cpuid >> 5, a->cpuid & 0x1f, - instr, instr, a->instrlen, + writ_instr, instr, a->instrlen, replacement, a->replacementlen, a->padlen); DUMP_BYTES(instr, a->instrlen, "%px: old_insn: ", instr); DUMP_BYTES(replacement, a->replacementlen, "%px: rpl_insn: ", replacement); - memcpy(insn_buff, replacement, a->replacementlen); + memcpy(insn_buff, writ_replacement, a->replacementlen); insn_buff_sz = a->replacementlen; /* @@ -435,7 +437,7 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start, } DUMP_BYTES(insn_buff, insn_buff_sz, "%px: final_insn: ", instr); - text_poke_early(instr, insn_buff, insn_buff_sz); + text_poke_early(writ_instr, insn_buff, insn_buff_sz); } } @@ -496,6 +498,10 @@ void __init_or_module alternatives_smp_module_add(struct module *mod, void *text, void *text_end) { struct smp_alt_module *smp; + void *w_locks = module_adjust_writable_addr(locks); + void *w_locks_end = module_adjust_writable_addr(locks_end); + void *w_text = module_adjust_writable_addr(text); + void *w_text_end = module_adjust_writable_addr(text_end); mutex_lock(&text_mutex); if (!uniproc_patched) @@ -522,7 +528,7 @@ void __init_or_module alternatives_smp_module_add(struct module *mod, list_add_tail(&smp->next, &smp_alt_modules); smp_unlock: - alternatives_smp_unlock(locks, locks_end, text, text_end); + alternatives_smp_unlock(w_locks, w_locks_end, w_text, w_text_end); unlock: mutex_unlock(&text_mutex); } @@ -601,17 +607,18 @@ void __init_or_module apply_paravirt(struct paravirt_patch_site *start, for (p = start; p < end; p++) { unsigned int used; + void *writable = module_adjust_writable_addr(p->instr); BUG_ON(p->len > MAX_PATCH_LEN); /* prep the buffer with the original instructions */ - memcpy(insn_buff, p->instr, p->len); - used = pv_ops.init.patch(p->type, insn_buff, (unsigned long)p->instr, p->len); + memcpy(insn_buff, writable, p->len); + used = pv_ops.init.patch(p->type, insn_buff, (unsigned long)writable, p->len); BUG_ON(used > p->len); /* Pad the rest with nops */ add_nops(insn_buff + used, p->len - used); - text_poke_early(p->instr, insn_buff, p->len); + text_poke_early(writable, insn_buff, p->len); } } extern struct paravirt_patch_site __start_parainstructions[], From patchwork Fri Nov 20 20:24:23 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Edgecombe, Rick P" X-Patchwork-Id: 11922293 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0CBE2C64E7C for ; Fri, 20 Nov 2020 20:31:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CE38B2223F for ; Fri, 20 Nov 2020 20:31:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730762AbgKTUaX (ORCPT ); Fri, 20 Nov 2020 15:30:23 -0500 Received: from mga07.intel.com ([134.134.136.100]:10351 "EHLO mga07.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731028AbgKTUaF (ORCPT ); Fri, 20 Nov 2020 15:30:05 -0500 IronPort-SDR: otZFfCoVRRW2Uo5wohzoHrfbawoByO9UYV5ROVaAZQipW3OuxU4ER+Uh6dOe3RYnGjnJWg9bIB IXE7Hf17yNrA== X-IronPort-AV: E=McAfee;i="6000,8403,9811"; a="235683307" X-IronPort-AV: E=Sophos;i="5.78,357,1599548400"; d="scan'208";a="235683307" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Nov 2020 12:30:04 -0800 IronPort-SDR: U72t0mwZ0fnOScOzQ+Cap1KnJJ99WrXmxF1BTKUleIlNidw12qKWyWmE3jcq71e92pPJ/C0z1M wXnoMBoRm+QA== X-IronPort-AV: E=Sophos;i="5.78,357,1599548400"; d="scan'208";a="342163315" Received: from rpedgeco-mobl.amr.corp.intel.com (HELO localhost.intel.com) ([10.209.105.214]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Nov 2020 12:30:03 -0800 From: Rick Edgecombe To: akpm@linux-foundation.org, jeyu@kernel.org, bpf@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net, luto@kernel.org, dave.hansen@linux.intel.com, peterz@infradead.org, x86@kernel.org, rppt@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, dan.j.williams@intel.com Cc: elena.reshetova@intel.com, ira.weiny@intel.com, Rick Edgecombe Subject: [PATCH RFC 07/10] x86/unwind: Unwind orc at module writable address Date: Fri, 20 Nov 2020 12:24:23 -0800 Message-Id: <20201120202426.18009-8-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20201120202426.18009-1-rick.p.edgecombe@intel.com> References: <20201120202426.18009-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-State: RFC Since modules can have a separate writable address during loading, do the orc unwind at the writable address. Signed-off-by: Rick Edgecombe --- arch/x86/kernel/unwind_orc.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c index 73f800100066..41f9022a10cc 100644 --- a/arch/x86/kernel/unwind_orc.c +++ b/arch/x86/kernel/unwind_orc.c @@ -238,8 +238,8 @@ static int orc_sort_cmp(const void *_a, const void *_b) void unwind_module_init(struct module *mod, void *_orc_ip, size_t orc_ip_size, void *_orc, size_t orc_size) { - int *orc_ip = _orc_ip; - struct orc_entry *orc = _orc; + int *orc_ip = module_adjust_writable_addr(_orc_ip); + struct orc_entry *orc = module_adjust_writable_addr(_orc); unsigned int num_entries = orc_ip_size / sizeof(int); WARN_ON_ONCE(orc_ip_size % sizeof(int) != 0 || @@ -257,8 +257,8 @@ void unwind_module_init(struct module *mod, void *_orc_ip, size_t orc_ip_size, sort(orc_ip, num_entries, sizeof(int), orc_sort_cmp, orc_sort_swap); mutex_unlock(&sort_mutex); - mod->arch.orc_unwind_ip = orc_ip; - mod->arch.orc_unwind = orc; + mod->arch.orc_unwind_ip = _orc_ip; + mod->arch.orc_unwind = _orc; mod->arch.num_orcs = num_entries; } #endif From patchwork Fri Nov 20 20:24:24 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Edgecombe, Rick P" X-Patchwork-Id: 11922291 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id F38FEC6379D for ; Fri, 20 Nov 2020 20:31:21 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C052322470 for ; Fri, 20 Nov 2020 20:31:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731080AbgKTUaO (ORCPT ); Fri, 20 Nov 2020 15:30:14 -0500 Received: from mga07.intel.com ([134.134.136.100]:10348 "EHLO mga07.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731043AbgKTUaF (ORCPT ); Fri, 20 Nov 2020 15:30:05 -0500 IronPort-SDR: g9KTMKH16I0tf0sG0egd2iBrm0xjTaVZoINBmcUXqUB1xFuKKHMaHRBsa0amu2oIQ5H4WuumxB FKaIrlnH/6UA== X-IronPort-AV: E=McAfee;i="6000,8403,9811"; a="235683310" X-IronPort-AV: E=Sophos;i="5.78,357,1599548400"; d="scan'208";a="235683310" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Nov 2020 12:30:04 -0800 IronPort-SDR: gJ6W2CFNMwn1RagqoGLH/KeqmBi6Dhzx2j/5cSBj3ZQTj3v8ZsjxGeTStmhQgZKVVm4KAoPcnN T/CopTEyg3wQ== X-IronPort-AV: E=Sophos;i="5.78,357,1599548400"; d="scan'208";a="342163319" Received: from rpedgeco-mobl.amr.corp.intel.com (HELO localhost.intel.com) ([10.209.105.214]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Nov 2020 12:30:04 -0800 From: Rick Edgecombe To: akpm@linux-foundation.org, jeyu@kernel.org, bpf@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net, luto@kernel.org, dave.hansen@linux.intel.com, peterz@infradead.org, x86@kernel.org, rppt@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, dan.j.williams@intel.com Cc: elena.reshetova@intel.com, ira.weiny@intel.com, Rick Edgecombe Subject: [PATCH RFC 08/10] jump_label: Handle module writable address Date: Fri, 20 Nov 2020 12:24:24 -0800 Message-Id: <20201120202426.18009-9-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20201120202426.18009-1-rick.p.edgecombe@intel.com> References: <20201120202426.18009-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-State: RFC Since modules can have a separate writable address during loading, do the nop application at the writable address. As long as info is on hand about if the operations is happening during a module load, don't do a full text_poke() when writing data to a writable address. Signed-off-by: Rick Edgecombe --- arch/x86/kernel/jump_label.c | 18 ++++++++++++++++-- kernel/jump_label.c | 2 +- 2 files changed, 17 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c index 5ba8477c2cb7..7a50148a63dd 100644 --- a/arch/x86/kernel/jump_label.c +++ b/arch/x86/kernel/jump_label.c @@ -63,6 +63,18 @@ static inline void __jump_label_transform(struct jump_entry *entry, int init) { const void *opcode = __jump_label_set_jump_code(entry, type, init); + unsigned long addr = jump_entry_code(entry); + struct module *mod = __module_address(addr); + bool mod_writable = false; + + if (mod) { + struct perm_allocation *alloc = module_get_allocation(mod, addr); + + if (perm_is_writable(alloc)) { + addr = perm_writable_addr(alloc, addr); + mod_writable = true; + } + } /* * As long as only a single processor is running and the code is still @@ -74,9 +86,11 @@ static inline void __jump_label_transform(struct jump_entry *entry, * At the time the change is being done, just ignore whether we * are doing nop -> jump or jump -> nop transition, and assume * always nop being the 'currently valid' instruction + * + * If this is a module being loaded, text_poke_early can also be used. */ - if (init || system_state == SYSTEM_BOOTING) { - text_poke_early((void *)jump_entry_code(entry), opcode, + if (init || system_state == SYSTEM_BOOTING || mod_writable) { + text_poke_early((void *)addr, opcode, JUMP_LABEL_NOP_SIZE); return; } diff --git a/kernel/jump_label.c b/kernel/jump_label.c index 015ef903ce8c..3919e78fce12 100644 --- a/kernel/jump_label.c +++ b/kernel/jump_label.c @@ -595,7 +595,7 @@ static void __jump_label_mod_update(struct static_key *key) */ void jump_label_apply_nops(struct module *mod) { - struct jump_entry *iter_start = mod->jump_entries; + struct jump_entry *iter_start = module_adjust_writable_addr(mod->jump_entries); struct jump_entry *iter_stop = iter_start + mod->num_jump_entries; struct jump_entry *iter; From patchwork Fri Nov 20 20:24:25 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Edgecombe, Rick P" X-Patchwork-Id: 11922289 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 32D03C64E69 for ; Fri, 20 Nov 2020 20:31:22 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id EE13B2223F for ; Fri, 20 Nov 2020 20:31:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731084AbgKTUaP (ORCPT ); Fri, 20 Nov 2020 15:30:15 -0500 Received: from mga07.intel.com ([134.134.136.100]:10351 "EHLO mga07.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731051AbgKTUaF (ORCPT ); Fri, 20 Nov 2020 15:30:05 -0500 IronPort-SDR: tdEv5aCqpis0/Hy76GMpPUr8pAkLgknJJgxd2TIc4KVHtSxPbkczdCTKW1FfaPyLMMCA3IbemT IaQUHiQoWCEQ== X-IronPort-AV: E=McAfee;i="6000,8403,9811"; a="235683315" X-IronPort-AV: E=Sophos;i="5.78,357,1599548400"; d="scan'208";a="235683315" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Nov 2020 12:30:05 -0800 IronPort-SDR: L1xDqxf3UVznloBFPan2HQRYqMBh/YHfvW36kbJ8ZoLYsTqGGXh+fYXmNokknMcHOPOuJTURc/ R1wrNmlQC+Hg== X-IronPort-AV: E=Sophos;i="5.78,357,1599548400"; d="scan'208";a="342163328" Received: from rpedgeco-mobl.amr.corp.intel.com (HELO localhost.intel.com) ([10.209.105.214]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Nov 2020 12:30:04 -0800 From: Rick Edgecombe To: akpm@linux-foundation.org, jeyu@kernel.org, bpf@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net, luto@kernel.org, dave.hansen@linux.intel.com, peterz@infradead.org, x86@kernel.org, rppt@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, dan.j.williams@intel.com Cc: elena.reshetova@intel.com, ira.weiny@intel.com, Rick Edgecombe Subject: [PATCH RFC 09/10] ftrace: Use module writable address Date: Fri, 20 Nov 2020 12:24:25 -0800 Message-Id: <20201120202426.18009-10-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20201120202426.18009-1-rick.p.edgecombe@intel.com> References: <20201120202426.18009-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-State: RFC Use the module writable address to accommodate arch's that have a separate writable address for perm_alloc. Signed-off-by: Rick Edgecombe --- kernel/trace/ftrace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 8185f7240095..ea6377108bab 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -6134,7 +6134,7 @@ static int ftrace_process_locs(struct module *mod, if (!count) return 0; - sort(start, count, sizeof(*start), + sort(module_adjust_writable_addr(start), count, sizeof(*start), ftrace_cmp_ips, NULL); start_pg = ftrace_allocate_pages(count); From patchwork Fri Nov 20 20:24:26 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Edgecombe, Rick P" X-Patchwork-Id: 11922285 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5DCB1C56202 for ; Fri, 20 Nov 2020 20:31:21 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0D77C22464 for ; Fri, 20 Nov 2020 20:31:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731063AbgKTUaH (ORCPT ); Fri, 20 Nov 2020 15:30:07 -0500 Received: from mga07.intel.com ([134.134.136.100]:10351 "EHLO mga07.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731055AbgKTUaG (ORCPT ); Fri, 20 Nov 2020 15:30:06 -0500 IronPort-SDR: q6kFQBIEGpqbjxf9qJYOoSEc48nAbKSb34lUe/iXfgxJjk6pfcX4MmGQBEw2ewZkA/R2EISLsY v2TFIYdVqFug== X-IronPort-AV: E=McAfee;i="6000,8403,9811"; a="235683321" X-IronPort-AV: E=Sophos;i="5.78,357,1599548400"; d="scan'208";a="235683321" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Nov 2020 12:30:05 -0800 IronPort-SDR: 97QE0b6Gshe6vW0gPefTAnuBtLy3iVJfb8FOcK/CZUxunD/qo1MFjTsqzK/RnCx9nBCbiMK71W gUVhzP7+yCVQ== X-IronPort-AV: E=Sophos;i="5.78,357,1599548400"; d="scan'208";a="342163333" Received: from rpedgeco-mobl.amr.corp.intel.com (HELO localhost.intel.com) ([10.209.105.214]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Nov 2020 12:30:05 -0800 From: Rick Edgecombe To: akpm@linux-foundation.org, jeyu@kernel.org, bpf@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net, luto@kernel.org, dave.hansen@linux.intel.com, peterz@infradead.org, x86@kernel.org, rppt@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, dan.j.williams@intel.com Cc: elena.reshetova@intel.com, ira.weiny@intel.com, Rick Edgecombe Subject: [PATCH RFC 10/10] vmalloc: Add perm_alloc x86 implementation Date: Fri, 20 Nov 2020 12:24:26 -0800 Message-Id: <20201120202426.18009-11-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20201120202426.18009-1-rick.p.edgecombe@intel.com> References: <20201120202426.18009-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-State: RFC Create a custom perm_alloc() implementation for x86. Allocate a new buffer to be returned by perm_writable_addr(), and text_poke() it into the final allocation in perm_writable_finish(). Map the allocation at it's final permissions so that it does not need to be shot down as it transitions from writable to read-only. Since x86 direct map permissions are set via set_memory_() calls whose alias mappings can be scattered across the direct map, instead draw from blocks of 2MB unmapped pages. Previously the direct map was marked RO for a RO vmalloc alias, however unmapping the direct map instead will allow us to remap them later without a shootdown. Register a shrinker callback to allow for this pool of pages to be freed. This should reduce breaking large pages on the direct map over the existing set_memory_() based solutions. Since the pages are left as 4k NP pages, we can remap them in the shrinker callback without any split or flush. An algorithm for how many unmapped pages to keep in the cache is still a todo. Currently it will grow until the shrinker triggers. By grabbing whole 2MB pages this may reduce the availability of high page order blocks in some cases. However, it could just as easily improve the availability of high order pages for cases where the vmalloc allocations could be long lived, such as modules, by grouping these long lived allocations together. By caching pages, this would however reduce pages available to other parts of the kernel for use in other caches. Signed-off-by: Rick Edgecombe --- arch/x86/Kconfig | 1 + arch/x86/include/asm/set_memory.h | 2 + arch/x86/mm/Makefile | 1 + arch/x86/mm/pat/set_memory.c | 13 + arch/x86/mm/vmalloc.c | 438 ++++++++++++++++++++++++++++++ 5 files changed, 455 insertions(+) create mode 100644 arch/x86/mm/vmalloc.c diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index f6946b81f74a..aa3c19fa23f0 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -78,6 +78,7 @@ config X86 select ARCH_HAS_COPY_MC if X86_64 select ARCH_HAS_SET_MEMORY select ARCH_HAS_SET_DIRECT_MAP + select ARCH_HAS_PERM_ALLOC_IMPLEMENTATION if X86_64 select ARCH_HAS_STRICT_KERNEL_RWX select ARCH_HAS_STRICT_MODULE_RWX select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h index 5948218f35c5..0d3efed009bb 100644 --- a/arch/x86/include/asm/set_memory.h +++ b/arch/x86/include/asm/set_memory.h @@ -49,6 +49,8 @@ int set_memory_decrypted(unsigned long addr, int numpages); int set_memory_np_noalias(unsigned long addr, int numpages); int set_memory_nonglobal(unsigned long addr, int numpages); int set_memory_global(unsigned long addr, int numpages); +int set_memory_noalias_noflush(unsigned long addr, int numpages, + pgprot_t mask_set, pgprot_t mask_clr); int set_pages_array_uc(struct page **pages, int addrinarray); int set_pages_array_wc(struct page **pages, int addrinarray); diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile index 5864219221ca..fb7a566bfedc 100644 --- a/arch/x86/mm/Makefile +++ b/arch/x86/mm/Makefile @@ -35,6 +35,7 @@ obj-$(CONFIG_PTDUMP_CORE) += dump_pagetables.o obj-$(CONFIG_PTDUMP_DEBUGFS) += debug_pagetables.o obj-$(CONFIG_HIGHMEM) += highmem_32.o +obj-$(CONFIG_ARCH_HAS_PERM_ALLOC_IMPLEMENTATION) += vmalloc.o KASAN_SANITIZE_kasan_init_$(BITS).o := n obj-$(CONFIG_KASAN) += kasan_init_$(BITS).o diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 40baa90e74f4..59f42de1e6ab 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -1954,6 +1954,19 @@ int set_memory_np_noalias(unsigned long addr, int numpages) cpa_flags, NULL); } +int set_memory_noalias_noflush(unsigned long addr, int numpages, + pgprot_t mask_set, pgprot_t mask_clr) +{ + struct cpa_data cpa = { .vaddr = &addr, + .pgd = NULL, + .numpages = numpages, + .mask_set = mask_set, + .mask_clr = mask_clr, + .flags = CPA_NO_CHECK_ALIAS}; + + return __change_page_attr_set_clr(&cpa, 0); +} + int set_memory_4k(unsigned long addr, int numpages) { return change_page_attr_set_clr(&addr, numpages, __pgprot(0), diff --git a/arch/x86/mm/vmalloc.c b/arch/x86/mm/vmalloc.c new file mode 100644 index 000000000000..2bc6381cbd2d --- /dev/null +++ b/arch/x86/mm/vmalloc.c @@ -0,0 +1,438 @@ +// SPDX-License-Identifier: GPL-2.0-only + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define DIRECTMAP_PROTECTED 0 +#define DIRECTMAP_UNPROTECTED 1 + +static LIST_HEAD(pages); +static int page_cnt; +static DEFINE_SPINLOCK(pages_lock); + +#define DEFINE_RANGE_CALC(x) struct range_calc x = {.min = ULONG_MAX, .max = 0} + +struct range_calc { + unsigned long min; + unsigned long max; +}; + +static inline void grow_range(struct range_calc *grower, unsigned long addr, unsigned long size) +{ + if (addr < grower->min) + grower->min = addr; + if (addr + size > grower->max) + grower->max = addr; +} + +static inline void grow_range_page(struct range_calc *grower, struct page *page) +{ + unsigned long addr = (unsigned long)page_address(page); + + if (!addr) + return; + + grow_range(grower, addr, PAGE_SIZE); +} + +static inline bool has_range(struct range_calc *grower) +{ + return grower->min != ULONG_MAX && grower->max != 0; +} + +static unsigned long perm_shrink_count(struct shrinker *shrinker, struct shrink_control *sc) +{ + unsigned long ret; + + spin_lock(&pages_lock); + ret = page_cnt; + spin_unlock(&pages_lock); + + return ret; +} + +static struct page *__get_perm_page(void) +{ + struct page *page; + + spin_lock(&pages_lock); + page = list_first_entry(&pages, struct page, lru); + list_del(&page->lru); + spin_unlock(&pages_lock); + + return page; +} + +static unsigned long perm_shrink_scan(struct shrinker *shrinker, struct shrink_control *sc) +{ + DEFINE_RANGE_CALC(range); + static LIST_HEAD(to_free_list); + struct page *page, *tmp; + unsigned int i, cnt = 0; + + for (i = 0; i < sc->nr_to_scan; i++) { + page = __get_perm_page(); + if (!page) + continue; + + grow_range_page(&range, page); + set_direct_map_default_noflush(page); + list_add(&page->lru, &to_free_list); + cnt++; + } + + if (has_range(&range)) + flush_tlb_kernel_range(range.min, range.max); + + list_for_each_entry_safe(page, tmp, &to_free_list, lru) + __free_pages(page, 0); + + return cnt; +} + +static struct shrinker perm_shrinker = { + .count_objects = perm_shrink_count, + .scan_objects = perm_shrink_scan, + .seeks = DEFAULT_SEEKS +}; + +static bool replenish_pages_one(void) +{ + struct page *page = alloc_pages(GFP_KERNEL | __GFP_ZERO, 0); + + if (!page) + return false; + + spin_lock(&pages_lock); + list_add(&page->lru, &pages); + page->private = DIRECTMAP_UNPROTECTED; + page_cnt++; + spin_unlock(&pages_lock); + + return true; +} + +static bool replenish_pages(void) +{ + struct page *page = alloc_pages(GFP_KERNEL | __GFP_ZERO, 9); /* 2MB */ + DEFINE_RANGE_CALC(range); + int convert_ret = 0; + int i; + + if (!page) + return replenish_pages_one(); + + for (i = 0; i < 512; i++) + convert_ret |= set_direct_map_invalid_noflush(&page[i]); + + if (convert_ret) + goto convert_fail; + + spin_lock(&pages_lock); + for (i = 0; i < 512; i++) { + list_add(&page[i].lru, &pages); + page[i].private = DIRECTMAP_PROTECTED; + grow_range_page(&range, page); + } + page_cnt += 512; + spin_unlock(&pages_lock); + + flush_tlb_kernel_range(range.min, range.max); + + vm_unmap_aliases(); + + return true; + +convert_fail: + for (i = 0; i < 512; i++) { + set_direct_map_default_noflush(&page[i]); + __free_pages(&page[i], 0); + } + + return false; +} + +static struct page *get_perm_page(void) +{ + struct page *page; + + spin_lock(&pages_lock); + if (!page_cnt) { + spin_unlock(&pages_lock); + if (!replenish_pages()) + return NULL; + spin_lock(&pages_lock); + } + + page = list_first_entry(&pages, struct page, lru); + page_cnt--; + list_del(&page->lru); + spin_unlock(&pages_lock); + + return page; +} + +static void __perm_free_page(struct page *page) +{ + spin_lock(&pages_lock); + list_add(&page->lru, &pages); + page_cnt++; + spin_unlock(&pages_lock); +} + +static void __perm_free_pages(struct page **page, int count) +{ + int i; + + spin_lock(&pages_lock); + for (i = 0; i < count; i++) + list_add(&page[i]->lru, &pages); + page_cnt += count; + spin_unlock(&pages_lock); +} + +static inline pgprot_t perms_to_prot(virtual_perm perms) +{ + switch (perms) { + case PERM_RX: + return PAGE_KERNEL_ROX; + case PERM_RWX: + return PAGE_KERNEL_EXEC; + case PERM_RW: + return PAGE_KERNEL; + case PERM_R: + return PAGE_KERNEL_RO; + default: + return __pgprot(0); + } +} + +static bool map_alloc(struct perm_allocation *alloc) +{ + return !map_kernel_range(perm_alloc_address(alloc), get_vm_area_size(alloc->area), + perms_to_prot(alloc->cur_perm), alloc->pages); +} + +struct perm_allocation *perm_alloc(unsigned long vstart, unsigned long vend, unsigned long page_cnt, + virtual_perm perms) +{ + struct perm_allocation *alloc; + DEFINE_RANGE_CALC(range); + int i, j; + + if (!page_cnt) + return NULL; + + alloc = kmalloc(sizeof(*alloc), GFP_KERNEL | __GFP_ZERO); + + if (!alloc) + return NULL; + + alloc->area = __get_vm_area_caller(page_cnt << PAGE_SHIFT, VM_MAP, vstart, vend, + __builtin_return_address(0)); + + if (!alloc->area) + goto free_alloc; + + alloc->pages = kvmalloc(page_cnt * sizeof(struct page *), GFP_KERNEL); + if (!alloc->pages) + goto free_area; + + alloc->size = (unsigned long)get_vm_area_size(alloc->area); + alloc->offset = 0; + alloc->writable = NULL; + alloc->cur_perm = perms; + + /* TODO if this will be RW, we don't need unmapped pages, better to conserve those */ + for (i = 0; i < page_cnt; i++) { + alloc->pages[i] = get_perm_page(); + if (alloc->pages[i]->private == DIRECTMAP_PROTECTED) + continue; + + grow_range_page(&range, alloc->pages[i]); + if (set_direct_map_invalid_noflush(alloc->pages[i])) + goto convert_fail; + alloc->pages[i]->private = DIRECTMAP_PROTECTED; + } + + /* + * Flush any pages that were removed in the loop above. In the event of no pages in the + * cache, these may be scattered about single pages, so flush here to only have a single + * flush instead of one for each replenish_pages_one() call. + */ + if (has_range(&range)) { + flush_tlb_kernel_range(range.min, range.max); + vm_unmap_aliases(); + } + + if (i != page_cnt) + goto free_pages; + + /* TODO: Need to zero these pages */ + if (!map_alloc(alloc)) + goto free_pages; + + return alloc; + +free_pages: + __perm_free_pages(alloc->pages, i); + kvfree(alloc->pages); +free_area: + remove_vm_area(alloc->area->addr); +free_alloc: + kfree(alloc); + + return NULL; + +convert_fail: + for (j = 0; j < i - 1; j++) + __perm_free_page(alloc->pages[j]); + + return NULL; +} + +unsigned long perm_writable_addr(struct perm_allocation *alloc, unsigned long addr) +{ + /* If this is already mapped and writable, just write to the actual kva */ + if (alloc->cur_perm & PERM_W) + return addr; + + /* TODO lock or callers need to synchronize? */ + /* TODO could fail, set primary alias writable in this case or alloc up front */ + if (!alloc->writable) + alloc->writable = vmalloc(alloc->size); + + return (unsigned long)alloc->writable + (addr - perm_alloc_address(alloc)); +} + +bool perm_writable_finish(struct perm_allocation *alloc) +{ + unsigned long addr; + void *writable; + int i; + + if (!alloc) + return false; + if (!alloc->writable) + return true; + + addr = perm_alloc_address(alloc); + + writable = (void *)perm_writable_base(alloc); + + mutex_lock(&text_mutex); + for (i = 0; i < perm_alloc_size(alloc); i += PAGE_SIZE) + text_poke((void *)(addr + i), writable + i, PAGE_SIZE); + mutex_unlock(&text_mutex); + + vfree(alloc->writable); + + alloc->writable = 0; + + return true; +} + +static inline pgprot_t get_set(virtual_perm perms) +{ + pteval_t ret = 0; + + if (perms) + ret = _PAGE_PRESENT; + + if (perms & PERM_W) + ret |= _PAGE_RW; + + if (~perms & PERM_X) + ret |= _PAGE_NX; + + return __pgprot(ret); +} + +static inline pgprot_t get_unset(virtual_perm perms) +{ + pteval_t ret = 0; + + if (!perms) + ret = _PAGE_PRESENT; + + if (~perms & PERM_W) + ret |= _PAGE_RW; + + if (perms & PERM_X) + ret |= _PAGE_NX; + + return __pgprot(ret); +} + +bool perm_change(struct perm_allocation *alloc, virtual_perm perms) +{ + pgprot_t set = get_set(perms); + pgprot_t clr = get_unset(perms); + unsigned long start; + + if (!alloc) + return false; + + if (!(alloc->cur_perm ^ perms)) + return true; + + start = perm_alloc_address(alloc); + + set_memory_noalias_noflush(start, perm_alloc_size(alloc) >> PAGE_SHIFT, set, clr); + + flush_tlb_kernel_range(start, start + perm_alloc_size(alloc)); + + return true; +} + +static inline bool perms_need_flush(struct perm_allocation *alloc) +{ + return alloc->cur_perm & PERM_X; +} + +void perm_free(struct perm_allocation *alloc) +{ + unsigned long size, addr; + int page_cnt; + + if (!alloc) + return; + + if (alloc->writable) + vfree(alloc->writable); + + size = get_vm_area_size(alloc->area); + addr = perm_alloc_address(alloc); + page_cnt = size >> PAGE_SHIFT; + + vunmap((void *)addr); + + if (perms_need_flush(alloc)) + flush_tlb_kernel_range(addr, addr + size); + + __perm_free_pages(alloc->pages, page_cnt); + + kvfree(alloc->pages); + kfree(alloc); +} + +void perm_memset(struct perm_allocation *alloc, char val) +{ + if (!alloc) + return; + + memset((void *)perm_writable_base(alloc), val, perm_alloc_size(alloc)); +} + +static int __init perm_alloc_init(void) +{ + return register_shrinker(&perm_shrinker); +} +device_initcall(perm_alloc_init);