From patchwork Wed Feb 23 05:21:37 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756361 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C1CF1C433EF for ; Wed, 23 Feb 2022 05:23:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237727AbiBWFYS (ORCPT ); Wed, 23 Feb 2022 00:24:18 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56418 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235058AbiBWFYQ (ORCPT ); Wed, 23 Feb 2022 00:24:16 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E5C5669294 for ; Tue, 22 Feb 2022 21:23:48 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id l3-20020a25ad43000000b0062462e2af34so11455926ybe.17 for ; Tue, 22 Feb 2022 21:23:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=GiCelUYVwIstxpHcG9rjUpGG5AWgwrhv320gd9YieGc=; b=m37cSIhex1P0AZECSya53Nwlg/yln3Q85Hc/7yVpLCT4G9GGxthOi0alDY9kzflfJf jxr7dI9kLc34Zh2tgIIu7ncsA8mItJVunH8yTxcyUS0nT8UjkaLLgEbCfjeQThYH++9U X5/yPZI6EEWT1qhiFMvs4k6thovqNzSAhAgs+b0wZLFQZwXp2Q22vOzrG2juvEle09q1 lF2qYsQdKi3eJhFxiXSLgstSyfZd9tAOiBs2kmdCVwg1XZsE0rPkNGxamGQwsj1qmYfa 3N/pDdrSsLFF0KDU8Yzh8eFsdN7hYEX4OK/Dt3o9A1u0TvL2rmxVKhEMmpc6wiFqQO4Y WIOg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=GiCelUYVwIstxpHcG9rjUpGG5AWgwrhv320gd9YieGc=; b=UNfaNswsYR9U5wzp8Y4vhOQ6+VvF1M1gYyL1HaSbWBgOhg/7lnP/DmbEDCdcZCJpBG T/9A0FfSKju6GgxNjxkmaOc11jVwLnegx+FwI8moeqATJpRaZm7Zq+s858ObTHrmEc/M 0OmduNpgzVi8NS2NmXKvxhFg/ZL0l9cAm4FFyvy6Uv1UFtJnI9NEWIoKKoLOh3zXgHow veOeyQ31TUv2W6ucisMf1XM75SEPlKdq12FlND8Qp9Z2CAfAEHVv8zYrqjZ6rwlpbWYh S/u6fQP/N7M1eGr23ZTfRzEJ4VCc+1u2EybnjSFH6HluwM7F6qH8J5ESTh6u2Sr8Wup3 yPSw== X-Gm-Message-State: AOAM531D7HF0UolFW39A1fj9e920AkUjteS5kf0vqzEY+R+mdole3Nup 8SaqawiY+7fvc+7q7F10EIWXByt15bzM X-Google-Smtp-Source: ABdhPJymjY3mQJBHUfMhfZ8vJdOZZMoFyZQ84NUDUQbSn/nngiTOsENeiNKk8wsCIb50BwzPk44T8MOnbWIc X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a25:2b0a:0:b0:624:a898:3e2f with SMTP id r10-20020a252b0a000000b00624a8983e2fmr9721548ybr.643.1645593828179; Tue, 22 Feb 2022 21:23:48 -0800 (PST) Date: Tue, 22 Feb 2022 21:21:37 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-2-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 01/47] mm: asi: Introduce ASI core API From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Introduce core API for Address Space Isolation (ASI). Kernel address space isolation provides the ability to run some kernel code with a reduced kernel address space. There can be multiple classes of such restricted kernel address spaces (e.g. KPTI, KVM-PTI etc.). Each ASI class is identified by an index. The ASI class can register some hooks to be called when entering/exiting the restricted address space. Currently, there is a fixed maximum number of ASI classes supported. In addition, each process can have at most one restricted address space from each ASI class. Neither of these are inherent limitations and are merely simplifying assumptions for the time being. (The Kconfig and the high-level ASI API are derived from the original ASI RFC by Alexandre Chartre). Originally-by: Alexandre Chartre Signed-off-by: Junaid Shahid --- arch/alpha/include/asm/Kbuild | 1 + arch/arc/include/asm/Kbuild | 1 + arch/arm/include/asm/Kbuild | 1 + arch/arm64/include/asm/Kbuild | 1 + arch/csky/include/asm/Kbuild | 1 + arch/h8300/include/asm/Kbuild | 1 + arch/hexagon/include/asm/Kbuild | 1 + arch/ia64/include/asm/Kbuild | 1 + arch/m68k/include/asm/Kbuild | 1 + arch/microblaze/include/asm/Kbuild | 1 + arch/mips/include/asm/Kbuild | 1 + arch/nds32/include/asm/Kbuild | 1 + arch/nios2/include/asm/Kbuild | 1 + arch/openrisc/include/asm/Kbuild | 1 + arch/parisc/include/asm/Kbuild | 1 + arch/powerpc/include/asm/Kbuild | 1 + arch/riscv/include/asm/Kbuild | 1 + arch/s390/include/asm/Kbuild | 1 + arch/sh/include/asm/Kbuild | 1 + arch/sparc/include/asm/Kbuild | 1 + arch/um/include/asm/Kbuild | 1 + arch/x86/include/asm/asi.h | 81 +++++++++++++++ arch/x86/include/asm/tlbflush.h | 2 + arch/x86/mm/Makefile | 1 + arch/x86/mm/asi.c | 152 +++++++++++++++++++++++++++++ arch/x86/mm/init.c | 5 +- arch/x86/mm/tlb.c | 2 +- arch/xtensa/include/asm/Kbuild | 1 + include/asm-generic/asi.h | 51 ++++++++++ include/linux/mm_types.h | 3 + kernel/fork.c | 3 + security/Kconfig | 10 ++ 32 files changed, 329 insertions(+), 3 deletions(-) create mode 100644 arch/x86/include/asm/asi.h create mode 100644 arch/x86/mm/asi.c create mode 100644 include/asm-generic/asi.h diff --git a/arch/alpha/include/asm/Kbuild b/arch/alpha/include/asm/Kbuild index 42911c8340c7..e3cd063d9cca 100644 --- a/arch/alpha/include/asm/Kbuild +++ b/arch/alpha/include/asm/Kbuild @@ -4,3 +4,4 @@ generated-y += syscall_table.h generic-y += export.h generic-y += kvm_para.h generic-y += mcs_spinlock.h +generic-y += asi.h diff --git a/arch/arc/include/asm/Kbuild b/arch/arc/include/asm/Kbuild index 3c1afa524b9c..60bdeffa7c31 100644 --- a/arch/arc/include/asm/Kbuild +++ b/arch/arc/include/asm/Kbuild @@ -4,3 +4,4 @@ generic-y += kvm_para.h generic-y += mcs_spinlock.h generic-y += parport.h generic-y += user.h +generic-y += asi.h diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild index 03657ff8fbe3..1e2c3d8dbbd9 100644 --- a/arch/arm/include/asm/Kbuild +++ b/arch/arm/include/asm/Kbuild @@ -6,3 +6,4 @@ generic-y += parport.h generated-y += mach-types.h generated-y += unistd-nr.h +generic-y += asi.h diff --git a/arch/arm64/include/asm/Kbuild b/arch/arm64/include/asm/Kbuild index 64202010b700..086e94f00f94 100644 --- a/arch/arm64/include/asm/Kbuild +++ b/arch/arm64/include/asm/Kbuild @@ -4,5 +4,6 @@ generic-y += mcs_spinlock.h generic-y += qrwlock.h generic-y += qspinlock.h generic-y += user.h +generic-y += asi.h generated-y += cpucaps.h diff --git a/arch/csky/include/asm/Kbuild b/arch/csky/include/asm/Kbuild index 904a18a818be..b4af49fa48c3 100644 --- a/arch/csky/include/asm/Kbuild +++ b/arch/csky/include/asm/Kbuild @@ -6,3 +6,4 @@ generic-y += kvm_para.h generic-y += qrwlock.h generic-y += user.h generic-y += vmlinux.lds.h +generic-y += asi.h diff --git a/arch/h8300/include/asm/Kbuild b/arch/h8300/include/asm/Kbuild index e23139c8fc0d..f1e937df4c8e 100644 --- a/arch/h8300/include/asm/Kbuild +++ b/arch/h8300/include/asm/Kbuild @@ -6,3 +6,4 @@ generic-y += kvm_para.h generic-y += mcs_spinlock.h generic-y += parport.h generic-y += spinlock.h +generic-y += asi.h diff --git a/arch/hexagon/include/asm/Kbuild b/arch/hexagon/include/asm/Kbuild index 3ece3c93fe08..744ffbeeb7ae 100644 --- a/arch/hexagon/include/asm/Kbuild +++ b/arch/hexagon/include/asm/Kbuild @@ -3,3 +3,4 @@ generic-y += extable.h generic-y += iomap.h generic-y += kvm_para.h generic-y += mcs_spinlock.h +generic-y += asi.h diff --git a/arch/ia64/include/asm/Kbuild b/arch/ia64/include/asm/Kbuild index f994c1daf9d4..897a388f3e85 100644 --- a/arch/ia64/include/asm/Kbuild +++ b/arch/ia64/include/asm/Kbuild @@ -3,3 +3,4 @@ generated-y += syscall_table.h generic-y += kvm_para.h generic-y += mcs_spinlock.h generic-y += vtime.h +generic-y += asi.h diff --git a/arch/m68k/include/asm/Kbuild b/arch/m68k/include/asm/Kbuild index 0dbf9c5c6fae..faf0f135df4a 100644 --- a/arch/m68k/include/asm/Kbuild +++ b/arch/m68k/include/asm/Kbuild @@ -4,3 +4,4 @@ generic-y += extable.h generic-y += kvm_para.h generic-y += mcs_spinlock.h generic-y += spinlock.h +generic-y += asi.h diff --git a/arch/microblaze/include/asm/Kbuild b/arch/microblaze/include/asm/Kbuild index a055f5dbe00a..012e4bf83c13 100644 --- a/arch/microblaze/include/asm/Kbuild +++ b/arch/microblaze/include/asm/Kbuild @@ -8,3 +8,4 @@ generic-y += parport.h generic-y += syscalls.h generic-y += tlb.h generic-y += user.h +generic-y += asi.h diff --git a/arch/mips/include/asm/Kbuild b/arch/mips/include/asm/Kbuild index dee172716581..b2c7b62536b4 100644 --- a/arch/mips/include/asm/Kbuild +++ b/arch/mips/include/asm/Kbuild @@ -14,3 +14,4 @@ generic-y += parport.h generic-y += qrwlock.h generic-y += qspinlock.h generic-y += user.h +generic-y += asi.h diff --git a/arch/nds32/include/asm/Kbuild b/arch/nds32/include/asm/Kbuild index 82a4453c9c2d..e8c4cf63db79 100644 --- a/arch/nds32/include/asm/Kbuild +++ b/arch/nds32/include/asm/Kbuild @@ -6,3 +6,4 @@ generic-y += gpio.h generic-y += kvm_para.h generic-y += parport.h generic-y += user.h +generic-y += asi.h diff --git a/arch/nios2/include/asm/Kbuild b/arch/nios2/include/asm/Kbuild index 7fe7437555fb..bfdc4026c5b1 100644 --- a/arch/nios2/include/asm/Kbuild +++ b/arch/nios2/include/asm/Kbuild @@ -5,3 +5,4 @@ generic-y += kvm_para.h generic-y += mcs_spinlock.h generic-y += spinlock.h generic-y += user.h +generic-y += asi.h diff --git a/arch/openrisc/include/asm/Kbuild b/arch/openrisc/include/asm/Kbuild index ca5987e11053..3d365bec74d0 100644 --- a/arch/openrisc/include/asm/Kbuild +++ b/arch/openrisc/include/asm/Kbuild @@ -7,3 +7,4 @@ generic-y += qspinlock.h generic-y += qrwlock_types.h generic-y += qrwlock.h generic-y += user.h +generic-y += asi.h diff --git a/arch/parisc/include/asm/Kbuild b/arch/parisc/include/asm/Kbuild index e6e7f74c8ac9..b14e4f727331 100644 --- a/arch/parisc/include/asm/Kbuild +++ b/arch/parisc/include/asm/Kbuild @@ -4,3 +4,4 @@ generated-y += syscall_table_64.h generic-y += kvm_para.h generic-y += mcs_spinlock.h generic-y += user.h +generic-y += asi.h diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild index bcf95ce0964f..2aff0fa469c4 100644 --- a/arch/powerpc/include/asm/Kbuild +++ b/arch/powerpc/include/asm/Kbuild @@ -8,3 +8,4 @@ generic-y += mcs_spinlock.h generic-y += qrwlock.h generic-y += vtime.h generic-y += early_ioremap.h +generic-y += asi.h diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild index 445ccc97305a..3e2022a5a6c5 100644 --- a/arch/riscv/include/asm/Kbuild +++ b/arch/riscv/include/asm/Kbuild @@ -5,3 +5,4 @@ generic-y += flat.h generic-y += kvm_para.h generic-y += user.h generic-y += vmlinux.lds.h +generic-y += asi.h diff --git a/arch/s390/include/asm/Kbuild b/arch/s390/include/asm/Kbuild index 1a18d7b82f86..ef80906ed195 100644 --- a/arch/s390/include/asm/Kbuild +++ b/arch/s390/include/asm/Kbuild @@ -8,3 +8,4 @@ generic-y += asm-offsets.h generic-y += export.h generic-y += kvm_types.h generic-y += mcs_spinlock.h +generic-y += asi.h diff --git a/arch/sh/include/asm/Kbuild b/arch/sh/include/asm/Kbuild index fc44d9c88b41..ea19e4515828 100644 --- a/arch/sh/include/asm/Kbuild +++ b/arch/sh/include/asm/Kbuild @@ -3,3 +3,4 @@ generated-y += syscall_table.h generic-y += kvm_para.h generic-y += mcs_spinlock.h generic-y += parport.h +generic-y += asi.h diff --git a/arch/sparc/include/asm/Kbuild b/arch/sparc/include/asm/Kbuild index 0b9d98ced34a..08730a26aaed 100644 --- a/arch/sparc/include/asm/Kbuild +++ b/arch/sparc/include/asm/Kbuild @@ -4,3 +4,4 @@ generated-y += syscall_table_64.h generic-y += export.h generic-y += kvm_para.h generic-y += mcs_spinlock.h +generic-y += asi.h diff --git a/arch/um/include/asm/Kbuild b/arch/um/include/asm/Kbuild index e5a7b552bb38..b62245b2445a 100644 --- a/arch/um/include/asm/Kbuild +++ b/arch/um/include/asm/Kbuild @@ -27,3 +27,4 @@ generic-y += word-at-a-time.h generic-y += kprobes.h generic-y += mm_hooks.h generic-y += vga.h +generic-y += asi.h diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h new file mode 100644 index 000000000000..f9fc928a555d --- /dev/null +++ b/arch/x86/include/asm/asi.h @@ -0,0 +1,81 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_X86_ASI_H +#define _ASM_X86_ASI_H + +#include + +#include +#include + +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + +#define ASI_MAX_NUM_ORDER 2 +#define ASI_MAX_NUM (1 << ASI_MAX_NUM_ORDER) + +struct asi_state { + struct asi *curr_asi; + struct asi *target_asi; +}; + +struct asi_hooks { + /* Both of these functions MUST be idempotent and re-entrant. */ + + void (*post_asi_enter)(void); + void (*pre_asi_exit)(void); +}; + +struct asi_class { + struct asi_hooks ops; + uint flags; + const char *name; +}; + +struct asi { + pgd_t *pgd; + struct asi_class *class; + struct mm_struct *mm; +}; + +DECLARE_PER_CPU_ALIGNED(struct asi_state, asi_cpu_state); + +void asi_init_mm_state(struct mm_struct *mm); + +int asi_register_class(const char *name, uint flags, + const struct asi_hooks *ops); +void asi_unregister_class(int index); + +int asi_init(struct mm_struct *mm, int asi_index); +void asi_destroy(struct asi *asi); + +void asi_enter(struct asi *asi); +void asi_exit(void); + +static inline void asi_set_target_unrestricted(void) +{ + barrier(); + this_cpu_write(asi_cpu_state.target_asi, NULL); +} + +static inline struct asi *asi_get_current(void) +{ + return this_cpu_read(asi_cpu_state.curr_asi); +} + +static inline struct asi *asi_get_target(void) +{ + return this_cpu_read(asi_cpu_state.target_asi); +} + +static inline bool is_asi_active(void) +{ + return (bool)asi_get_current(); +} + +static inline bool asi_is_target_unrestricted(void) +{ + return !asi_get_target(); +} + +#endif /* CONFIG_ADDRESS_SPACE_ISOLATION */ + +#endif diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index b587a9ee9cb2..3c43ad46c14a 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -259,6 +259,8 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch); +unsigned long build_cr3(pgd_t *pgd, u16 asid); + #endif /* !MODULE */ #endif /* _ASM_X86_TLBFLUSH_H */ diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile index 5864219221ca..09d5e65e47c8 100644 --- a/arch/x86/mm/Makefile +++ b/arch/x86/mm/Makefile @@ -51,6 +51,7 @@ obj-$(CONFIG_NUMA_EMU) += numa_emulation.o obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o obj-$(CONFIG_PAGE_TABLE_ISOLATION) += pti.o +obj-$(CONFIG_ADDRESS_SPACE_ISOLATION) += asi.o obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt.o obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c new file mode 100644 index 000000000000..9928325f3787 --- /dev/null +++ b/arch/x86/mm/asi.c @@ -0,0 +1,152 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include + +#undef pr_fmt +#define pr_fmt(fmt) "ASI: " fmt + +static struct asi_class asi_class[ASI_MAX_NUM]; +static DEFINE_SPINLOCK(asi_class_lock); + +DEFINE_PER_CPU_ALIGNED(struct asi_state, asi_cpu_state); +EXPORT_PER_CPU_SYMBOL_GPL(asi_cpu_state); + +int asi_register_class(const char *name, uint flags, + const struct asi_hooks *ops) +{ + int i; + + VM_BUG_ON(name == NULL); + + spin_lock(&asi_class_lock); + + for (i = 1; i < ASI_MAX_NUM; i++) { + if (asi_class[i].name == NULL) { + asi_class[i].name = name; + asi_class[i].flags = flags; + if (ops != NULL) + asi_class[i].ops = *ops; + break; + } + } + + spin_unlock(&asi_class_lock); + + if (i == ASI_MAX_NUM) + i = -ENOSPC; + + return i; +} +EXPORT_SYMBOL_GPL(asi_register_class); + +void asi_unregister_class(int index) +{ + spin_lock(&asi_class_lock); + + WARN_ON(asi_class[index].name == NULL); + memset(&asi_class[index], 0, sizeof(struct asi_class)); + + spin_unlock(&asi_class_lock); +} +EXPORT_SYMBOL_GPL(asi_unregister_class); + +int asi_init(struct mm_struct *mm, int asi_index) +{ + struct asi *asi = &mm->asi[asi_index]; + + /* Index 0 is reserved for special purposes. */ + WARN_ON(asi_index == 0 || asi_index >= ASI_MAX_NUM); + WARN_ON(asi->pgd != NULL); + + /* + * For now, we allocate 2 pages to avoid any potential problems with + * KPTI code. This won't be needed once KPTI is folded into the ASI + * framework. + */ + asi->pgd = (pgd_t *)__get_free_pages(GFP_PGTABLE_USER, + PGD_ALLOCATION_ORDER); + if (!asi->pgd) + return -ENOMEM; + + asi->class = &asi_class[asi_index]; + asi->mm = mm; + + return 0; +} +EXPORT_SYMBOL_GPL(asi_init); + +void asi_destroy(struct asi *asi) +{ + free_pages((ulong)asi->pgd, PGD_ALLOCATION_ORDER); + memset(asi, 0, sizeof(struct asi)); +} +EXPORT_SYMBOL_GPL(asi_destroy); + +static void __asi_enter(void) +{ + u64 asi_cr3; + struct asi *target = this_cpu_read(asi_cpu_state.target_asi); + + VM_BUG_ON(preemptible()); + + if (!target || target == this_cpu_read(asi_cpu_state.curr_asi)) + return; + + VM_BUG_ON(this_cpu_read(cpu_tlbstate.loaded_mm) == + LOADED_MM_SWITCHING); + + this_cpu_write(asi_cpu_state.curr_asi, target); + + asi_cr3 = build_cr3(target->pgd, + this_cpu_read(cpu_tlbstate.loaded_mm_asid)); + write_cr3(asi_cr3); + + if (target->class->ops.post_asi_enter) + target->class->ops.post_asi_enter(); +} + +void asi_enter(struct asi *asi) +{ + VM_WARN_ON_ONCE(!asi); + + this_cpu_write(asi_cpu_state.target_asi, asi); + barrier(); + + __asi_enter(); +} +EXPORT_SYMBOL_GPL(asi_enter); + +void asi_exit(void) +{ + u64 unrestricted_cr3; + struct asi *asi; + + preempt_disable(); + + VM_BUG_ON(this_cpu_read(cpu_tlbstate.loaded_mm) == + LOADED_MM_SWITCHING); + + asi = this_cpu_read(asi_cpu_state.curr_asi); + + if (asi) { + if (asi->class->ops.pre_asi_exit) + asi->class->ops.pre_asi_exit(); + + unrestricted_cr3 = + build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd, + this_cpu_read(cpu_tlbstate.loaded_mm_asid)); + + write_cr3(unrestricted_cr3); + this_cpu_write(asi_cpu_state.curr_asi, NULL); + } + + preempt_enable(); +} +EXPORT_SYMBOL_GPL(asi_exit); + +void asi_init_mm_state(struct mm_struct *mm) +{ + memset(mm->asi, 0, sizeof(mm->asi)); +} diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index 1895986842b9..000cbe5315f5 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -238,8 +238,9 @@ static void __init probe_page_size_mask(void) /* By the default is everything supported: */ __default_kernel_pte_mask = __supported_pte_mask; - /* Except when with PTI where the kernel is mostly non-Global: */ - if (cpu_feature_enabled(X86_FEATURE_PTI)) + /* Except when with PTI or ASI where the kernel is mostly non-Global: */ + if (cpu_feature_enabled(X86_FEATURE_PTI) || + IS_ENABLED(CONFIG_ADDRESS_SPACE_ISOLATION)) __default_kernel_pte_mask &= ~_PAGE_GLOBAL; /* Enable 1 GB linear kernel mappings if available: */ diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 59ba2968af1b..88d9298720dc 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -153,7 +153,7 @@ static inline u16 user_pcid(u16 asid) return ret; } -static inline unsigned long build_cr3(pgd_t *pgd, u16 asid) +inline unsigned long build_cr3(pgd_t *pgd, u16 asid) { if (static_cpu_has(X86_FEATURE_PCID)) { return __sme_pa(pgd) | kern_pcid(asid); diff --git a/arch/xtensa/include/asm/Kbuild b/arch/xtensa/include/asm/Kbuild index 854c5e07e867..49fcdf9d83f5 100644 --- a/arch/xtensa/include/asm/Kbuild +++ b/arch/xtensa/include/asm/Kbuild @@ -7,3 +7,4 @@ generic-y += param.h generic-y += qrwlock.h generic-y += qspinlock.h generic-y += user.h +generic-y += asi.h diff --git a/include/asm-generic/asi.h b/include/asm-generic/asi.h new file mode 100644 index 000000000000..e5ba51d30b90 --- /dev/null +++ b/include/asm-generic/asi.h @@ -0,0 +1,51 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __ASM_GENERIC_ASI_H +#define __ASM_GENERIC_ASI_H + +/* ASI class flags */ +#define ASI_MAP_STANDARD_NONSENSITIVE 1 + +#ifndef CONFIG_ADDRESS_SPACE_ISOLATION + +#define ASI_MAX_NUM_ORDER 0 +#define ASI_MAX_NUM 0 + +#ifndef _ASSEMBLY_ + +struct asi_hooks {}; +struct asi {}; + +static inline +int asi_register_class(const char *name, uint flags, + const struct asi_hooks *ops) +{ + return 0; +} + +static inline void asi_unregister_class(int asi_index) { } + +static inline void asi_init_mm_state(struct mm_struct *mm) { } + +static inline int asi_init(struct mm_struct *mm, int asi_index) { return 0; } + +static inline void asi_destroy(struct asi *asi) { } + +static inline void asi_enter(struct asi *asi) { } + +static inline void asi_set_target_unrestricted(void) { } + +static inline bool asi_is_target_unrestricted(void) { return true; } + +static inline void asi_exit(void) { } + +static inline bool is_asi_active(void) { return false; } + +static inline struct asi *asi_get_target(void) { return NULL; } + +static inline struct asi *asi_get_current(void) { return NULL; } + +#endif /* !_ASSEMBLY_ */ + +#endif /* !CONFIG_ADDRESS_SPACE_ISOLATION */ + +#endif diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index c3a6e6209600..3de1afa57289 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -18,6 +18,7 @@ #include #include +#include #ifndef AT_VECTOR_SIZE_ARCH #define AT_VECTOR_SIZE_ARCH 0 @@ -495,6 +496,8 @@ struct mm_struct { atomic_t membarrier_state; #endif + struct asi asi[ASI_MAX_NUM]; + /** * @mm_users: The number of users including userspace. * diff --git a/kernel/fork.c b/kernel/fork.c index 3244cc56b697..3695a32ee9bd 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -102,6 +102,7 @@ #include #include #include +#include #include @@ -1071,6 +1072,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, mm->def_flags = 0; } + asi_init_mm_state(mm); + if (mm_alloc_pgd(mm)) goto fail_nopgd; diff --git a/security/Kconfig b/security/Kconfig index 0b847f435beb..21b15ecaf2c1 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -65,6 +65,16 @@ config PAGE_TABLE_ISOLATION See Documentation/x86/pti.rst for more details. +config ADDRESS_SPACE_ISOLATION + bool "Allow code to run with a reduced kernel address space" + default n + depends on X86_64 && !UML + depends on !PARAVIRT + help + This feature provides the ability to run some kernel code + with a reduced kernel address space. This can be used to + mitigate some speculative execution attacks. + config SECURITY_INFINIBAND bool "Infiniband Security Hooks" depends on SECURITY && INFINIBAND From patchwork Wed Feb 23 05:21:38 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756408 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BF683C4332F for ; Wed, 23 Feb 2022 05:23:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237826AbiBWFYV (ORCPT ); Wed, 23 Feb 2022 00:24:21 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56440 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237725AbiBWFYS (ORCPT ); Wed, 23 Feb 2022 00:24:18 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 37904692AD for ; Tue, 22 Feb 2022 21:23:51 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id w1-20020a05690204e100b006244315a721so15100815ybs.0 for ; Tue, 22 Feb 2022 21:23:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=w8tfMjRpTHBn0meL0Ch0aZ7WXP9rbXxYNEO8B63hsno=; b=davxLdvboa7AnosMDU4YZaonHOXc2mWvmXUYJD5HKzin8o36/qY50HjsjugGaI/D5l Ru80WqGe/FnLjt+8HnlGtPjt4Zi6YBsphfykkNdK7xrLMzwOVzFL31vs2n/ZZ0O/HnpI LOgi7W6aQ+4Zmn4Z1cOc+jbXsnCew4XJAl85YB5PRtJ03TFKlpmUqcnmAJ1Qg9xnXjcb 4yfUNHBaaQoJYVar5RgqzCzuUxniA1Jjald3axQoDsLvTOuO9FNS+ARVJ1i5nxUMdGko 1OO6tdOVImcBTLT1zhrx+dYQEI4jcgxd4WQZmhlrRvv7dBLxB4zea6BzNV1mQ4MyCUUD fnQg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=w8tfMjRpTHBn0meL0Ch0aZ7WXP9rbXxYNEO8B63hsno=; b=y2krTEA+q3DurwCqr4WWU2pNRH8HHWvKDTbdt0SAanwmgUse7HOnjVCcp1cfQ9f0rF 76g7/X76C4lILHcV/n8SQVA0uUsrXHQpSMMtmo3bZ86wbXD81q4JXkhe7YX5GcfwOwsX J5eH4CeU2eLMBkhNohv7V2qkcGwTIqmpciwSMe68jLhlv6JA3HosrM/2+XdM2AUD1QB6 UuEsHpwI1ajG1+XUdldPaZ69Y51T0A90ydmhhYvsFz2LG3EF17ee+Kiq2cN2FGx2Yyag lBE61AnmntIldxFms1PJwiw4z2fK6XM176CoxwMe6SAPsCFYW6ZZCmHFtJGCN5VQse3i MlTA== X-Gm-Message-State: AOAM532p1+LWAv5XO2Gpw4qJ76jtyJgI0pRYIBkr2Nnte7USUGfNBEmu Qzy3HnZ2NyWzIbTboZc0ESwLKjd2irhE X-Google-Smtp-Source: ABdhPJxbYNkojrkaLjILQ8n9XN/i4xd/c0tZZOEcEinUXddidcFfUTlIca2QMtLlpZBc7xT9Yey+F1gwDIpI X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a25:2551:0:b0:613:2017:b879 with SMTP id l78-20020a252551000000b006132017b879mr26593133ybl.557.1645593830476; Tue, 22 Feb 2022 21:23:50 -0800 (PST) Date: Tue, 22 Feb 2022 21:21:38 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-3-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 02/47] mm: asi: Add command-line parameter to enable/disable ASI From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org A parameter named "asi" is added, disabled by default. A feature flag X86_FEATURE_ASI is set if ASI is enabled. Signed-off-by: Junaid Shahid --- arch/x86/include/asm/asi.h | 17 ++++++++++---- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/disabled-features.h | 8 ++++++- arch/x86/mm/asi.c | 29 ++++++++++++++++++++++++ arch/x86/mm/init.c | 2 +- include/asm-generic/asi.h | 2 ++ 6 files changed, 53 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h index f9fc928a555d..0a4af23ed0eb 100644 --- a/arch/x86/include/asm/asi.h +++ b/arch/x86/include/asm/asi.h @@ -6,6 +6,7 @@ #include #include +#include #ifdef CONFIG_ADDRESS_SPACE_ISOLATION @@ -52,18 +53,24 @@ void asi_exit(void); static inline void asi_set_target_unrestricted(void) { - barrier(); - this_cpu_write(asi_cpu_state.target_asi, NULL); + if (static_cpu_has(X86_FEATURE_ASI)) { + barrier(); + this_cpu_write(asi_cpu_state.target_asi, NULL); + } } static inline struct asi *asi_get_current(void) { - return this_cpu_read(asi_cpu_state.curr_asi); + return static_cpu_has(X86_FEATURE_ASI) + ? this_cpu_read(asi_cpu_state.curr_asi) + : NULL; } static inline struct asi *asi_get_target(void) { - return this_cpu_read(asi_cpu_state.target_asi); + return static_cpu_has(X86_FEATURE_ASI) + ? this_cpu_read(asi_cpu_state.target_asi) + : NULL; } static inline bool is_asi_active(void) @@ -76,6 +83,8 @@ static inline bool asi_is_target_unrestricted(void) return !asi_get_target(); } +#define static_asi_enabled() cpu_feature_enabled(X86_FEATURE_ASI) + #endif /* CONFIG_ADDRESS_SPACE_ISOLATION */ #endif diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index d5b5f2ab87a0..0b0ead3cdd48 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -295,6 +295,7 @@ #define X86_FEATURE_PER_THREAD_MBA (11*32+ 7) /* "" Per-thread Memory Bandwidth Allocation */ #define X86_FEATURE_SGX1 (11*32+ 8) /* "" Basic SGX */ #define X86_FEATURE_SGX2 (11*32+ 9) /* "" SGX Enclave Dynamic Memory Management (EDMM) */ +#define X86_FEATURE_ASI (11*32+10) /* Kernel Address Space Isolation */ /* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */ #define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */ diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h index 8f28fafa98b3..9659cd9f867d 100644 --- a/arch/x86/include/asm/disabled-features.h +++ b/arch/x86/include/asm/disabled-features.h @@ -56,6 +56,12 @@ # define DISABLE_PTI (1 << (X86_FEATURE_PTI & 31)) #endif +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +# define DISABLE_ASI 0 +#else +# define DISABLE_ASI (1 << (X86_FEATURE_ASI & 31)) +#endif + /* Force disable because it's broken beyond repair */ #define DISABLE_ENQCMD (1 << (X86_FEATURE_ENQCMD & 31)) @@ -79,7 +85,7 @@ #define DISABLED_MASK8 0 #define DISABLED_MASK9 (DISABLE_SMAP|DISABLE_SGX) #define DISABLED_MASK10 0 -#define DISABLED_MASK11 0 +#define DISABLED_MASK11 (DISABLE_ASI) #define DISABLED_MASK12 0 #define DISABLED_MASK13 0 #define DISABLED_MASK14 0 diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c index 9928325f3787..d274c86f89b7 100644 --- a/arch/x86/mm/asi.c +++ b/arch/x86/mm/asi.c @@ -1,5 +1,7 @@ // SPDX-License-Identifier: GPL-2.0 +#include + #include #include #include @@ -18,6 +20,9 @@ int asi_register_class(const char *name, uint flags, { int i; + if (!boot_cpu_has(X86_FEATURE_ASI)) + return 0; + VM_BUG_ON(name == NULL); spin_lock(&asi_class_lock); @@ -43,6 +48,9 @@ EXPORT_SYMBOL_GPL(asi_register_class); void asi_unregister_class(int index) { + if (!boot_cpu_has(X86_FEATURE_ASI)) + return; + spin_lock(&asi_class_lock); WARN_ON(asi_class[index].name == NULL); @@ -52,10 +60,22 @@ void asi_unregister_class(int index) } EXPORT_SYMBOL_GPL(asi_unregister_class); +static int __init set_asi_param(char *str) +{ + if (strcmp(str, "on") == 0) + setup_force_cpu_cap(X86_FEATURE_ASI); + + return 0; +} +early_param("asi", set_asi_param); + int asi_init(struct mm_struct *mm, int asi_index) { struct asi *asi = &mm->asi[asi_index]; + if (!boot_cpu_has(X86_FEATURE_ASI)) + return 0; + /* Index 0 is reserved for special purposes. */ WARN_ON(asi_index == 0 || asi_index >= ASI_MAX_NUM); WARN_ON(asi->pgd != NULL); @@ -79,6 +99,9 @@ EXPORT_SYMBOL_GPL(asi_init); void asi_destroy(struct asi *asi) { + if (!boot_cpu_has(X86_FEATURE_ASI)) + return; + free_pages((ulong)asi->pgd, PGD_ALLOCATION_ORDER); memset(asi, 0, sizeof(struct asi)); } @@ -109,6 +132,9 @@ static void __asi_enter(void) void asi_enter(struct asi *asi) { + if (!static_cpu_has(X86_FEATURE_ASI)) + return; + VM_WARN_ON_ONCE(!asi); this_cpu_write(asi_cpu_state.target_asi, asi); @@ -123,6 +149,9 @@ void asi_exit(void) u64 unrestricted_cr3; struct asi *asi; + if (!static_cpu_has(X86_FEATURE_ASI)) + return; + preempt_disable(); VM_BUG_ON(this_cpu_read(cpu_tlbstate.loaded_mm) == diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index 000cbe5315f5..dfff17363365 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -240,7 +240,7 @@ static void __init probe_page_size_mask(void) __default_kernel_pte_mask = __supported_pte_mask; /* Except when with PTI or ASI where the kernel is mostly non-Global: */ if (cpu_feature_enabled(X86_FEATURE_PTI) || - IS_ENABLED(CONFIG_ADDRESS_SPACE_ISOLATION)) + cpu_feature_enabled(X86_FEATURE_ASI)) __default_kernel_pte_mask &= ~_PAGE_GLOBAL; /* Enable 1 GB linear kernel mappings if available: */ diff --git a/include/asm-generic/asi.h b/include/asm-generic/asi.h index e5ba51d30b90..dae1403ee1d0 100644 --- a/include/asm-generic/asi.h +++ b/include/asm-generic/asi.h @@ -44,6 +44,8 @@ static inline struct asi *asi_get_target(void) { return NULL; } static inline struct asi *asi_get_current(void) { return NULL; } +#define static_asi_enabled() false + #endif /* !_ASSEMBLY_ */ #endif /* !CONFIG_ADDRESS_SPACE_ISOLATION */ From patchwork Wed Feb 23 05:21:39 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756409 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BD06EC433F5 for ; Wed, 23 Feb 2022 05:24:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238021AbiBWFYZ (ORCPT ); Wed, 23 Feb 2022 00:24:25 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56482 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237716AbiBWFYU (ORCPT ); Wed, 23 Feb 2022 00:24:20 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 516EE694A9 for ; Tue, 22 Feb 2022 21:23:53 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id s22-20020a252d56000000b00624652ac3e1so11129304ybe.16 for ; Tue, 22 Feb 2022 21:23:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=o+K8zCWcV59JiWXD4aWine0/FDtELjiPI/RbdPNAHQc=; b=gL62Qc22CZ6FeqYe4RTaen0CJKVsVLhWP6XuIpy8wtIhquKE7Hq2aAgjM0cnFrMx// B9jyIFrBZPnR4ZblulS8U7l5+BX78tipmOGn+uNc7HwBgNSyCtAgwmuxuLupc9yf33fw sG7BRSGiBqlxsiI4E0nPILyGbfSDlu+DOPd9cNUxoIqlK1SF5LShFZNMYNQ5wWc1UUUq JRgPAlM6n7K8F/7k32Qk1plHhKLU8qRnhntMmXKxbAFtecCJ3vVWG8s4ryjAHjKMNW6w HoGY4yajLvjwNYPSxXLVJpFkRwFJsLHr1WAWSZWkzlHrdQQ5OWrVIF2pqwZ0e1qUQPaz mArQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=o+K8zCWcV59JiWXD4aWine0/FDtELjiPI/RbdPNAHQc=; b=Emestt3Ac8bLX/aT9FijF89EvfyV4KCKLvJx/L02KdNWqLPYTYs3Ob6PsGipm43Ht8 oGm5BBfF/yoJWlLwUVaRYQQL7I92C/ycTTo5KTc5hQzPePp3qTDqR9pQ+r5AY/2zmjn1 bzhGQ/7IhY/J0vKwvL81KSKu8T5coKZoRg/Zj7X5Rjtowyu84Dv6ZNbF+ry7CcgNyp1m LAaQIPjPSBY+AhjZwn4U63nVuXUaOhfvXVzU/gEQm2GyILooIboj9wB2g8f4lih86TJE Puz7wdNS4Q4yYyqHtrDXZ83OgjUaOWJNOrBOTToVQ2TLjYfBfpiyHZlE8OLh9Qga3dp3 oKSA== X-Gm-Message-State: AOAM533uBPp4gDBhtqhPjW/uGe656iGFd/huhPTzU8fBRG3+cD/H/rgG ecxFDt9NK447Zfc5RlL4RHXltD4ppskO X-Google-Smtp-Source: ABdhPJyhZCUKb1E/Ab+htpmdItbzaBqk6IfSNwdQ2cMPET/cQ9m/cxN2+Idkkun2NTapLkyPe3v/k6Dz4k/E X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a25:a486:0:b0:61d:a523:acd0 with SMTP id g6-20020a25a486000000b0061da523acd0mr25432547ybi.203.1645593832574; Tue, 22 Feb 2022 21:23:52 -0800 (PST) Date: Tue, 22 Feb 2022 21:21:39 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-4-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 03/47] mm: asi: Switch to unrestricted address space when entering scheduler From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org To keep things simpler, we run the scheduler only in the full unrestricted address space for the time being. Signed-off-by: Junaid Shahid --- kernel/sched/core.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 77563109c0ea..44ea197c16ea 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -19,6 +19,7 @@ #include #include +#include #include "../workqueue_internal.h" #include "../../fs/io-wq.h" @@ -6141,6 +6142,10 @@ static void __sched notrace __schedule(unsigned int sched_mode) rq = cpu_rq(cpu); prev = rq->curr; + /* This could possibly be delayed to just before the context switch. */ + VM_WARN_ON(!asi_is_target_unrestricted()); + asi_exit(); + schedule_debug(prev, !!sched_mode); if (sched_feat(HRTICK) || sched_feat(HRTICK_DL)) From patchwork Wed Feb 23 05:21:40 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756410 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5D13FC433EF for ; Wed, 23 Feb 2022 05:24:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238127AbiBWFY3 (ORCPT ); Wed, 23 Feb 2022 00:24:29 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56510 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237957AbiBWFYW (ORCPT ); Wed, 23 Feb 2022 00:24:22 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 51CEF694AF for ; Tue, 22 Feb 2022 21:23:55 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-2d07ae11464so162250937b3.14 for ; Tue, 22 Feb 2022 21:23:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=kBbjpsiAT8t80DB/Hae/sSQvzOy4duVi7ttnmtZT7iU=; b=cvWzSg8yZgLUgWZKUIjW5TSN+BW3VTrtVOKpGEnHmCq9LFApxaz2ug4Y6Y4TweNu1W f9E1LiZV7paz48nIFQ58vfAnH1K6o0bvbWTXyZRNy2eoNt9EF9GseVfZcFrpOMpaAZsv YfeUwK9fKdGDUIC+bPsLXVrJ/dI3zlbx7hzpBmIG2MmUKT7no3dFhQFnuWlscR6nzDpw D94zRcBAw6obzNOLADy5DNW08Hz9FTyO1IWbrJxXtBCcNGZOBsCLR5Y7BX69Ut6ugHh9 /ICUNuePn6v0tNGb7si9lPoIGDB/04KZ4aNfI9KEiCrUUlX4abSfJJXZ09+2AJGxuoWO 1yZA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=kBbjpsiAT8t80DB/Hae/sSQvzOy4duVi7ttnmtZT7iU=; b=Ppv9iljdcW233TAKL76X4g0GNueKDUK7SkZ6rVXjlk5IylC+0q+TO8VGAE7xiX1pDE LmU409/KzLoXDw911Sv8OC5Y7VpybsDoWpxkyOyPMSYTnm9caBMSiCGVxWSF2Qbxf1KQ tXcGydVjjNkDDRAohZ9ex35VsQV87bCaibnSfH0tdmiHna1KaVQtqylfAceUKdIVMVq5 HhN1d2ursHfD0Ss32z92/+yCW4aHckFrRlDKnQ3FBV8WT+2Sa/C6RBplyKIcLFIAOFPB tj6gYrShoWM0foy0HJai/aCUl8Ho6ov3dc/s5ovbebbOv7UMUy4Z+KuB5mokTHEHkos9 GtWQ== X-Gm-Message-State: AOAM530LJQGtTahwLuCsJ2vUmogLHiNO9qLhD7l94UN2oZ/EGdMeDV7b ztRGSHKp7fnt+Ba2tPhh5Hpz9La3Iw1s X-Google-Smtp-Source: ABdhPJyBLtJwVHx/eQ2H1TXtGc5CWguV1DYKie3D2boh+nhOh7RhoskWr4B/mahF4dAK4TfkpcUJJeAOMAIF X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a05:6902:1ca:b0:624:e2a1:2856 with SMTP id u10-20020a05690201ca00b00624e2a12856mr4238491ybh.389.1645593834551; Tue, 22 Feb 2022 21:23:54 -0800 (PST) Date: Tue, 22 Feb 2022 21:21:40 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-5-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 04/47] mm: asi: ASI support in interrupts/exceptions From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Add support for potentially switching address spaces from within interrupts/exceptions/NMIs etc. An interrupt does not automatically switch to the unrestricted address space. It can switch if needed to access some memory not available in the restricted address space, using the normal asi_exit call. On return from the outermost interrupt, if the target address space was the restricted address space (e.g. we were in the critical code path between ASI Enter and VM Enter), the restricted address space will be automatically restored. Otherwise, execution will continue in the unrestricted address space until the next explicit ASI Enter. In order to keep track of when to restore the restricted address space, an interrupt/exception nesting depth counter is maintained per-task. An alternative implementation without needing this counter is also possible, but the counter unlocks an additional nice-to-have benefit by allowing detection of whether or not we are currently executing inside an exception context, which would be useful in a later patch. Signed-off-by: Junaid Shahid --- arch/x86/include/asm/asi.h | 35 ++++++++++++++++++++++++++++++++ arch/x86/include/asm/idtentry.h | 25 +++++++++++++++++++++-- arch/x86/include/asm/processor.h | 5 +++++ arch/x86/kernel/process.c | 2 ++ arch/x86/kernel/traps.c | 2 ++ arch/x86/mm/asi.c | 3 ++- kernel/entry/common.c | 6 ++++++ 7 files changed, 75 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h index 0a4af23ed0eb..7702332c62e8 100644 --- a/arch/x86/include/asm/asi.h +++ b/arch/x86/include/asm/asi.h @@ -4,6 +4,8 @@ #include +#include + #include #include #include @@ -51,6 +53,11 @@ void asi_destroy(struct asi *asi); void asi_enter(struct asi *asi); void asi_exit(void); +static inline void asi_init_thread_state(struct thread_struct *thread) +{ + thread->intr_nest_depth = 0; +} + static inline void asi_set_target_unrestricted(void) { if (static_cpu_has(X86_FEATURE_ASI)) { @@ -85,6 +92,34 @@ static inline bool asi_is_target_unrestricted(void) #define static_asi_enabled() cpu_feature_enabled(X86_FEATURE_ASI) +static inline void asi_intr_enter(void) +{ + if (static_cpu_has(X86_FEATURE_ASI)) { + current->thread.intr_nest_depth++; + barrier(); + } +} + +static inline void asi_intr_exit(void) +{ + void __asi_enter(void); + + if (static_cpu_has(X86_FEATURE_ASI)) { + barrier(); + + if (--current->thread.intr_nest_depth == 0) + __asi_enter(); + } +} + +#else /* CONFIG_ADDRESS_SPACE_ISOLATION */ + +static inline void asi_intr_enter(void) { } + +static inline void asi_intr_exit(void) { } + +static inline void asi_init_thread_state(struct thread_struct *thread) { } + #endif /* CONFIG_ADDRESS_SPACE_ISOLATION */ #endif diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index 1345088e9902..ea5cdc90403d 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -10,6 +10,7 @@ #include #include +#include /** * DECLARE_IDTENTRY - Declare functions for simple IDT entry points @@ -133,7 +134,16 @@ static __always_inline void __##func(struct pt_regs *regs, \ * is required before the enter/exit() helpers are invoked. */ #define DEFINE_IDTENTRY_RAW(func) \ -__visible noinstr void func(struct pt_regs *regs) +static __always_inline void __##func(struct pt_regs *regs); \ + \ +__visible noinstr void func(struct pt_regs *regs) \ +{ \ + asi_intr_enter(); \ + __##func (regs); \ + asi_intr_exit(); \ +} \ + \ +static __always_inline void __##func(struct pt_regs *regs) /** * DECLARE_IDTENTRY_RAW_ERRORCODE - Declare functions for raw IDT entry points @@ -161,7 +171,18 @@ __visible noinstr void func(struct pt_regs *regs) * is required before the enter/exit() helpers are invoked. */ #define DEFINE_IDTENTRY_RAW_ERRORCODE(func) \ -__visible noinstr void func(struct pt_regs *regs, unsigned long error_code) +static __always_inline void __##func(struct pt_regs *regs, \ + unsigned long error_code); \ + \ +__visible noinstr void func(struct pt_regs *regs, unsigned long error_code)\ +{ \ + asi_intr_enter(); \ + __##func (regs, error_code); \ + asi_intr_exit(); \ +} \ + \ +static __always_inline void __##func(struct pt_regs *regs, \ + unsigned long error_code) /** * DECLARE_IDTENTRY_IRQ - Declare functions for device interrupt IDT entry diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 355d38c0cf60..20116efd2756 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -519,6 +519,11 @@ struct thread_struct { unsigned int iopl_warn:1; unsigned int sig_on_uaccess_err:1; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + /* The nesting depth of exceptions/interrupts */ + int intr_nest_depth; +#endif + /* * Protection Keys Register for Userspace. Loaded immediately on * context switch. Store it in thread_struct to avoid a lookup in diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 04143a653a8a..c8d4a00a4de7 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -90,6 +90,8 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src) #ifdef CONFIG_VM86 dst->thread.vm86 = NULL; #endif + asi_init_thread_state(&dst->thread); + /* Drop the copied pointer to current's fpstate */ dst->thread.fpu.fpstate = NULL; diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index c9d566dcf89a..acf675ddda96 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -61,6 +61,7 @@ #include #include #include +#include #ifdef CONFIG_X86_64 #include @@ -413,6 +414,7 @@ DEFINE_IDTENTRY_DF(exc_double_fault) } #endif + asi_exit(); irqentry_nmi_enter(regs); instrumentation_begin(); notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV); diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c index d274c86f89b7..2453124f221d 100644 --- a/arch/x86/mm/asi.c +++ b/arch/x86/mm/asi.c @@ -107,12 +107,13 @@ void asi_destroy(struct asi *asi) } EXPORT_SYMBOL_GPL(asi_destroy); -static void __asi_enter(void) +void __asi_enter(void) { u64 asi_cr3; struct asi *target = this_cpu_read(asi_cpu_state.target_asi); VM_BUG_ON(preemptible()); + VM_BUG_ON(current->thread.intr_nest_depth != 0); if (!target || target == this_cpu_read(asi_cpu_state.curr_asi)) return; diff --git a/kernel/entry/common.c b/kernel/entry/common.c index d5a61d565ad5..9064253085c7 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -9,6 +9,8 @@ #include "common.h" +#include + #define CREATE_TRACE_POINTS #include @@ -321,6 +323,8 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs) .exit_rcu = false, }; + asi_intr_enter(); + if (user_mode(regs)) { irqentry_enter_from_user_mode(regs); return ret; @@ -416,6 +420,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state) instrumentation_end(); rcu_irq_exit(); lockdep_hardirqs_on(CALLER_ADDR0); + asi_intr_exit(); return; } @@ -438,6 +443,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state) if (state.exit_rcu) rcu_irq_exit(); } + asi_intr_exit(); } irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs) From patchwork Wed Feb 23 05:21:41 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756411 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 84080C433F5 for ; Wed, 23 Feb 2022 05:24:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238054AbiBWFYa (ORCPT ); Wed, 23 Feb 2022 00:24:30 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56600 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238008AbiBWFYY (ORCPT ); Wed, 23 Feb 2022 00:24:24 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9510569CF9 for ; Tue, 22 Feb 2022 21:23:57 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id m10-20020a25800a000000b0061daa5b7151so26448948ybk.10 for ; Tue, 22 Feb 2022 21:23:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=LI1Qxw8YovJHKPYFPYoCf4datmDsbayp3z46DL8Kp8Q=; b=U7xDPHt7nJyB2GTAgVc+5mVNa5D9AiEIkeTSJpYyuShkSe5G+Xo37Cmu4z2S0ODbd3 Wm5Eeq8cvy6TNvAv3A7bWuNv++i7NTg2KMLHM/m4eHjfRoQqwVJ2y/Yi5MDF7xrDbsQI j5wRAvkCT4cYk3Bc/Ve5H0pU1FD7r6VuCExamGfvXcE67Mlg9W1l1RFlwTf48DP0g9j8 zrQbvmaWMqoIe6J4oo0xoCcquiVzfC7OAHWIqYk7V7cZwcAdfqwpYM8UaM/ZPk1iYlJM o8gS8/KkwFxTjJeXBM9VLMqxRp+ccKcT56QtTvv+HXupkPEETHBW4Z+e7zRt00MQFKUT ixgA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=LI1Qxw8YovJHKPYFPYoCf4datmDsbayp3z46DL8Kp8Q=; b=FbK5yWGvXJYtq9EkuH9FvfR6Dv76h00fM/cKLszRybhOYT4uiz40zVApmxQ7wUsPLj JQt6IkPVxL4rW5i9pDTa+rmA/0yNmcIRXYuZ9odP9ozvfJ8AFxXj4UDbpuZVK61cjVoq mPEMO/Oj5JO8n3hRZADUmw254+rloG/zznVo9IjmuY9XHozjcJMd7mi31dejTj4TDyIN T6r4wt0LsmgSS8yG7DY162b0D+4IW5stnML2t7q3hdVDT2ar7GCxPl9KywlqBK19LvaV Z31xFBJx3Ldci73nZSRy4zsgB7diSq8Tj4pb1OJpznptT2naM9V9FkcoKe4IqRHNqWaU fQDg== X-Gm-Message-State: AOAM533VqD5KFtXmiWQYsj85NPf5bK1tylR+iWNUMI8MllOm2Jvwl/fe Qv+XvkvrtBx/SHJE359PWYd4iQNTD60R X-Google-Smtp-Source: ABdhPJxfznelYhWgkiOo62fJh4kZWSbldwDVUL5AZ4s809e7D3FsyXFvg8XSNCQZyLZ5Vxg+VGGrfUoGnpy5 X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a81:5cc3:0:b0:2d0:a2d0:9c0e with SMTP id q186-20020a815cc3000000b002d0a2d09c0emr27666033ywb.270.1645593836836; Tue, 22 Feb 2022 21:23:56 -0800 (PST) Date: Tue, 22 Feb 2022 21:21:41 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-6-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 05/47] mm: asi: Make __get_current_cr3_fast() ASI-aware From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org When ASI is active, __get_current_cr3_fast() adjusts the returned CR3 value accordingly to reflect the actual ASI CR3. Signed-off-by: Junaid Shahid --- arch/x86/include/asm/asi.h | 7 +++++++ arch/x86/mm/tlb.c | 20 ++++++++++++++++++-- 2 files changed, 25 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h index 7702332c62e8..95557211dabd 100644 --- a/arch/x86/include/asm/asi.h +++ b/arch/x86/include/asm/asi.h @@ -112,6 +112,11 @@ static inline void asi_intr_exit(void) } } +static inline pgd_t *asi_pgd(struct asi *asi) +{ + return asi->pgd; +} + #else /* CONFIG_ADDRESS_SPACE_ISOLATION */ static inline void asi_intr_enter(void) { } @@ -120,6 +125,8 @@ static inline void asi_intr_exit(void) { } static inline void asi_init_thread_state(struct thread_struct *thread) { } +static inline pgd_t *asi_pgd(struct asi *asi) { return NULL; } + #endif /* CONFIG_ADDRESS_SPACE_ISOLATION */ #endif diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 88d9298720dc..25bee959d1d3 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -17,6 +17,7 @@ #include #include #include +#include #include "mm_internal.h" @@ -1073,12 +1074,27 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end) */ unsigned long __get_current_cr3_fast(void) { - unsigned long cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd, - this_cpu_read(cpu_tlbstate.loaded_mm_asid)); + unsigned long cr3; + pgd_t *pgd; + u16 asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid); + struct asi *asi = asi_get_current(); + + if (asi) + pgd = asi_pgd(asi); + else + pgd = this_cpu_read(cpu_tlbstate.loaded_mm)->pgd; + + cr3 = build_cr3(pgd, asid); /* For now, be very restrictive about when this can be called. */ VM_WARN_ON(in_nmi() || preemptible()); + /* + * CR3 is unstable if the target ASI is unrestricted + * and a restricted ASI is currently loaded. + */ + VM_WARN_ON_ONCE(asi && asi_is_target_unrestricted()); + VM_BUG_ON(cr3 != __read_cr3()); return cr3; } From patchwork Wed Feb 23 05:21:42 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756412 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 052F7C433FE for ; Wed, 23 Feb 2022 05:24:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238098AbiBWFYb (ORCPT ); Wed, 23 Feb 2022 00:24:31 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56726 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237962AbiBWFY2 (ORCPT ); Wed, 23 Feb 2022 00:24:28 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BF0CB69CE6 for ; Tue, 22 Feb 2022 21:23:59 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-2d726bd83a2so91076827b3.20 for ; Tue, 22 Feb 2022 21:23:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=PIzleJhvx1e4in5SU2cz59acEkt7I4gZXxP/VePrhcs=; b=UOhpXDHCnae+DRnauRxZtdjiTln+VoGZhoV1LC2HwZ0wE6hCMEDx6ZsM5m4zhmzU8C zHAJJxzPwGGULan8LLoI7C3OSmGzlkgZIkeoNkliXHKxR2zxRtRuDwE2nmiSS1c4BXRL MiuUyWSM6+TUYE5RqqQekigHUWwoMR7es7zfFE2KzZ5ooe/XoTE9TQgA48klkJO1uSey E656AcQf7rLEbAA6uzuZiKaqjksL8iwXpjTOMvdQBp3iyrK4xwvnjKTRQ5lhnl0vyEeu pioGnqi8o2SeqNRitqaZjawpEPOB4cWAR7yNk2BDoyC4/ldrIIsdHyj3n/nSEdlK5M0a gypQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=PIzleJhvx1e4in5SU2cz59acEkt7I4gZXxP/VePrhcs=; b=ID8nqAcYzMMYYKukHKcg9FBPRFKAZrcChpm+KV87rtbhCvofobfqmJQ8U78M4YNUJf WHarkUI9ij74/y3/h65AgDU7vNQBV3bO6XJ4Vbn+/EZqydjo2WmfAsNXrlrzt0+wwLng GNhkAeudWNSDYsZiA5Qni2g+wQC0cHuFu+W1U/TmH5ehKNmOzAqHr6ej4L1TwaxWBTXg MtO8KmW6hOGV0SrPs2gffvnrUPzLpvo/9HBWio+YgfSOkbwYZVilPFRYkCQy3ftkmSU8 sPr68XryYB9QcIHsOLPVBNEse85y0iF0N59Fz9XQrFGn2Qr5l2TVPQOQUTVDBOrGxyxw CrVg== X-Gm-Message-State: AOAM531ANnALm9LU9OdrAE/04RU0def1nV1CZC27qdpa2L3ScA6MfJwe 03LBLdOppyJYlKiabDmehSGXkk4zBmJ0 X-Google-Smtp-Source: ABdhPJyZtnc6SgOpidUjkYpln4RLDN4W6lG8iW37Mn66M+p2QX3sw1OB0/Qqkeh8xQlAuLXN36L1X25C92Rt X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a0d:c607:0:b0:2ca:287c:6b6c with SMTP id i7-20020a0dc607000000b002ca287c6b6cmr28060793ywd.17.1645593839000; Tue, 22 Feb 2022 21:23:59 -0800 (PST) Date: Tue, 22 Feb 2022 21:21:42 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-7-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 06/47] mm: asi: ASI page table allocation and free functions From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org This adds custom allocation and free functions for ASI page tables. The alloc functions support allocating memory using different GFP reclaim flags, in order to be able to support non-sensitive allocations from both standard and atomic contexts. They also install the page tables locklessly, which makes it slightly simpler to handle non-sensitive allocations from interrupts/exceptions. The free functions recursively free the page tables when the ASI instance is being torn down. Signed-off-by: Junaid Shahid --- arch/x86/mm/asi.c | 109 +++++++++++++++++++++++++++++++++++++++- include/linux/pgtable.h | 3 ++ 2 files changed, 111 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c index 2453124f221d..40d772b2e2a8 100644 --- a/arch/x86/mm/asi.c +++ b/arch/x86/mm/asi.c @@ -60,6 +60,113 @@ void asi_unregister_class(int index) } EXPORT_SYMBOL_GPL(asi_unregister_class); +#ifndef mm_inc_nr_p4ds +#define mm_inc_nr_p4ds(mm) do {} while (false) +#endif + +#ifndef mm_dec_nr_p4ds +#define mm_dec_nr_p4ds(mm) do {} while (false) +#endif + +#define pte_offset pte_offset_kernel + +#define DEFINE_ASI_PGTBL_ALLOC(base, level) \ +static level##_t * asi_##level##_alloc(struct asi *asi, \ + base##_t *base, ulong addr, \ + gfp_t flags) \ +{ \ + if (unlikely(base##_none(*base))) { \ + ulong pgtbl = get_zeroed_page(flags); \ + phys_addr_t pgtbl_pa; \ + \ + if (pgtbl == 0) \ + return NULL; \ + \ + pgtbl_pa = __pa(pgtbl); \ + paravirt_alloc_##level(asi->mm, PHYS_PFN(pgtbl_pa)); \ + \ + if (cmpxchg((ulong *)base, 0, \ + pgtbl_pa | _PAGE_TABLE) == 0) { \ + mm_inc_nr_##level##s(asi->mm); \ + } else { \ + paravirt_release_##level(PHYS_PFN(pgtbl_pa)); \ + free_page(pgtbl); \ + } \ + \ + /* NOP on native. PV call on Xen. */ \ + set_##base(base, *base); \ + } \ + VM_BUG_ON(base##_large(*base)); \ + return level##_offset(base, addr); \ +} + +DEFINE_ASI_PGTBL_ALLOC(pgd, p4d) +DEFINE_ASI_PGTBL_ALLOC(p4d, pud) +DEFINE_ASI_PGTBL_ALLOC(pud, pmd) +DEFINE_ASI_PGTBL_ALLOC(pmd, pte) + +#define asi_free_dummy(asi, addr) +#define __pmd_free(mm, pmd) free_page((ulong)(pmd)) +#define pud_page_vaddr(pud) ((ulong)pud_pgtable(pud)) +#define p4d_page_vaddr(p4d) ((ulong)p4d_pgtable(p4d)) + +static inline unsigned long pte_page_vaddr(pte_t pte) +{ + return (unsigned long)__va(pte_val(pte) & PTE_PFN_MASK); +} + +#define DEFINE_ASI_PGTBL_FREE(level, LEVEL, next, free) \ +static void asi_free_##level(struct asi *asi, ulong pgtbl_addr) \ +{ \ + uint i; \ + level##_t *level = (level##_t *)pgtbl_addr; \ + \ + for (i = 0; i < PTRS_PER_##LEVEL; i++) { \ + ulong vaddr; \ + \ + if (level##_none(level[i])) \ + continue; \ + \ + vaddr = level##_page_vaddr(level[i]); \ + \ + if (!level##_leaf(level[i])) \ + asi_free_##next(asi, vaddr); \ + else \ + VM_WARN(true, "Lingering mapping in ASI %p at %lx",\ + asi, vaddr); \ + } \ + paravirt_release_##level(PHYS_PFN(__pa(pgtbl_addr))); \ + free(asi->mm, level); \ + mm_dec_nr_##level##s(asi->mm); \ +} + +DEFINE_ASI_PGTBL_FREE(pte, PTE, dummy, pte_free_kernel) +DEFINE_ASI_PGTBL_FREE(pmd, PMD, pte, __pmd_free) +DEFINE_ASI_PGTBL_FREE(pud, PUD, pmd, pud_free) +DEFINE_ASI_PGTBL_FREE(p4d, P4D, pud, p4d_free) + +static void asi_free_pgd_range(struct asi *asi, uint start, uint end) +{ + uint i; + + for (i = start; i < end; i++) + if (pgd_present(asi->pgd[i])) + asi_free_p4d(asi, (ulong)p4d_offset(asi->pgd + i, 0)); +} + +/* + * Free the page tables allocated for the given ASI instance. + * The caller must ensure that all the mappings have already been cleared + * and appropriate TLB flushes have been issued before calling this function. + */ +static void asi_free_pgd(struct asi *asi) +{ + VM_BUG_ON(asi->mm == &init_mm); + + asi_free_pgd_range(asi, KERNEL_PGD_BOUNDARY, PTRS_PER_PGD); + free_pages((ulong)asi->pgd, PGD_ALLOCATION_ORDER); +} + static int __init set_asi_param(char *str) { if (strcmp(str, "on") == 0) @@ -102,7 +209,7 @@ void asi_destroy(struct asi *asi) if (!boot_cpu_has(X86_FEATURE_ASI)) return; - free_pages((ulong)asi->pgd, PGD_ALLOCATION_ORDER); + asi_free_pgd(asi); memset(asi, 0, sizeof(struct asi)); } EXPORT_SYMBOL_GPL(asi_destroy); diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index e24d2c992b11..2fff17a939f0 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1593,6 +1593,9 @@ typedef unsigned int pgtbl_mod_mask; #ifndef pmd_leaf #define pmd_leaf(x) 0 #endif +#ifndef pte_leaf +#define pte_leaf(x) 1 +#endif #ifndef pgd_leaf_size #define pgd_leaf_size(x) (1ULL << PGDIR_SHIFT) From patchwork Wed Feb 23 05:21:43 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756413 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0174DC433F5 for ; Wed, 23 Feb 2022 05:24:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238125AbiBWFYf (ORCPT ); Wed, 23 Feb 2022 00:24:35 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56854 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238115AbiBWFYb (ORCPT ); Wed, 23 Feb 2022 00:24:31 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0B2D669CC1 for ; Tue, 22 Feb 2022 21:24:01 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id o5-20020a25d705000000b0062499d760easo8076385ybg.7 for ; Tue, 22 Feb 2022 21:24:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=4OZ+JPmRPuew/8tnwetStHaUs2Nh17IXa7mxxiAMwj8=; b=kb4u7VjrnfodFdsXodDcDOsogb6ps0VFuoU2Ch/2lXixZ/HfftZw3kdktGLZNqJaaS sS3He2L5FPw70IJb8AKJo1NDhJWYKpPBXq/OhYV1okDxsl0mvsUmyEav/HUJeog/Gj9y Q5Dc8pVb07blL49n409GgYRDrcKVQVQRVFNARfm5uMsXWp3Yd02cr7kkTrGmuAR1Xu1E sxdm0LFpj5VhSie+6tVTejQjNAf6nTCGVLsZ8GtzBSgzjYfKnMkFSPkXEgjwcV5k4Zwo pwZnNuNgZRjJww5tQ8PUPFXrFLOBqqQa1ttWIBXg8Sv/mjjUaM/rS2VlzHPfgRo69Xqa n3Jg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=4OZ+JPmRPuew/8tnwetStHaUs2Nh17IXa7mxxiAMwj8=; b=23AelsSjI8ui78l1GbLLtK+UbWdVR53c6LlEyur8JeOoS8IIRFjFiiZB7QkEKp36jb reV5B9SFUKXWO51lB9dCQfTUTcmMMipoI2BqYrnZd8aoOb7ScY0Uz3SKT0jJy0r1Ury9 5XA/X/QCG6GFUwEHMcl+2T31LOoF86UjAv+nJ12rI791ctIH/KML3W+xEemYup6P57Bg 4l2Z3LgGC2yrVEqMsKIK2Ur36wVto64Af1+pTRCW1VaY1GeoldNIfVxCEKu58VaNxmzs r9EiRM31Ka7vfK8vZ7kcl9s3016kUmoT1SXWnzAiToLLPhIUf38t9oLMm9FIgv/Mhn0M 65gQ== X-Gm-Message-State: AOAM530g+6ph0uIUrHNgFBI/CTB8/DP3PGz1UD/Y6+JV1HSXq1zOLRMx SLYbugiG8tQ784mQ7Pxg14GuhJ9KIB2g X-Google-Smtp-Source: ABdhPJznarXNTwMlwRo1R3Ew9ob69h2spGBZggFbqvzN1k2HjtUx3qHkoNOb8UlGcjxUj887oMmqSttakWi9 X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a0d:eb09:0:b0:2d1:e0df:5104 with SMTP id u9-20020a0deb09000000b002d1e0df5104mr27667696ywe.250.1645593841036; Tue, 22 Feb 2022 21:24:01 -0800 (PST) Date: Tue, 22 Feb 2022 21:21:43 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-8-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 07/47] mm: asi: Functions to map/unmap a memory range into ASI page tables From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Two functions, asi_map() and asi_map_gfp(), are added to allow mapping memory into ASI page tables. The mapping will be identical to the one for the same virtual address in the unrestricted page tables. This is necessary to allow switching between the page tables at any arbitrary point in the kernel. Another function, asi_unmap() is added to allow unmapping memory mapped via asi_map* Signed-off-by: Junaid Shahid --- arch/x86/include/asm/asi.h | 5 + arch/x86/mm/asi.c | 196 +++++++++++++++++++++++++++++++++++++ include/asm-generic/asi.h | 19 ++++ mm/internal.h | 3 + mm/vmalloc.c | 60 +++++++----- 5 files changed, 261 insertions(+), 22 deletions(-) diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h index 95557211dabd..521b40d1864b 100644 --- a/arch/x86/include/asm/asi.h +++ b/arch/x86/include/asm/asi.h @@ -53,6 +53,11 @@ void asi_destroy(struct asi *asi); void asi_enter(struct asi *asi); void asi_exit(void); +int asi_map_gfp(struct asi *asi, void *addr, size_t len, gfp_t gfp_flags); +int asi_map(struct asi *asi, void *addr, size_t len); +void asi_unmap(struct asi *asi, void *addr, size_t len, bool flush_tlb); +void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len); + static inline void asi_init_thread_state(struct thread_struct *thread) { thread->intr_nest_depth = 0; diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c index 40d772b2e2a8..84d220cbdcfc 100644 --- a/arch/x86/mm/asi.c +++ b/arch/x86/mm/asi.c @@ -6,6 +6,8 @@ #include #include +#include "../../../mm/internal.h" + #undef pr_fmt #define pr_fmt(fmt) "ASI: " fmt @@ -287,3 +289,197 @@ void asi_init_mm_state(struct mm_struct *mm) { memset(mm->asi, 0, sizeof(mm->asi)); } + +static bool is_page_within_range(size_t addr, size_t page_size, + size_t range_start, size_t range_end) +{ + size_t page_start, page_end, page_mask; + + page_mask = ~(page_size - 1); + page_start = addr & page_mask; + page_end = page_start + page_size; + + return page_start >= range_start && page_end <= range_end; +} + +static bool follow_physaddr(struct mm_struct *mm, size_t virt, + phys_addr_t *phys, size_t *page_size, ulong *flags) +{ + pgd_t *pgd; + p4d_t *p4d; + pud_t *pud; + pmd_t *pmd; + pte_t *pte; + +#define follow_addr_at_level(base, level, LEVEL) \ + do { \ + *page_size = LEVEL##_SIZE; \ + level = level##_offset(base, virt); \ + if (!level##_present(*level)) \ + return false; \ + \ + if (level##_large(*level)) { \ + *phys = PFN_PHYS(level##_pfn(*level)) | \ + (virt & ~LEVEL##_MASK); \ + *flags = level##_flags(*level); \ + return true; \ + } \ + } while (false) + + follow_addr_at_level(mm, pgd, PGDIR); + follow_addr_at_level(pgd, p4d, P4D); + follow_addr_at_level(p4d, pud, PUD); + follow_addr_at_level(pud, pmd, PMD); + + *page_size = PAGE_SIZE; + pte = pte_offset_map(pmd, virt); + if (!pte) + return false; + + if (!pte_present(*pte)) { + pte_unmap(pte); + return false; + } + + *phys = PFN_PHYS(pte_pfn(*pte)) | (virt & ~PAGE_MASK); + *flags = pte_flags(*pte); + + pte_unmap(pte); + return true; + +#undef follow_addr_at_level +} + +/* + * Map the given range into the ASI page tables. The source of the mapping + * is the regular unrestricted page tables. + * Can be used to map any kernel memory. + * + * The caller MUST ensure that the source mapping will not change during this + * function. For dynamic kernel memory, this is generally ensured by mapping + * the memory within the allocator. + * + * If the source mapping is a large page and the range being mapped spans the + * entire large page, then it will be mapped as a large page in the ASI page + * tables too. If the range does not span the entire huge page, then it will + * be mapped as smaller pages. In that case, the implementation is slightly + * inefficient, as it will walk the source page tables again for each small + * destination page, but that should be ok for now, as usually in such cases, + * the range would consist of a small-ish number of pages. + */ +int asi_map_gfp(struct asi *asi, void *addr, size_t len, gfp_t gfp_flags) +{ + size_t virt; + size_t start = (size_t)addr; + size_t end = start + len; + size_t page_size; + + if (!static_cpu_has(X86_FEATURE_ASI)) + return 0; + + VM_BUG_ON(start & ~PAGE_MASK); + VM_BUG_ON(len & ~PAGE_MASK); + VM_BUG_ON(start < TASK_SIZE_MAX); + + gfp_flags &= GFP_RECLAIM_MASK; + + if (asi->mm != &init_mm) + gfp_flags |= __GFP_ACCOUNT; + + for (virt = start; virt < end; virt = ALIGN(virt + 1, page_size)) { + pgd_t *pgd; + p4d_t *p4d; + pud_t *pud; + pmd_t *pmd; + pte_t *pte; + phys_addr_t phys; + ulong flags; + + if (!follow_physaddr(asi->mm, virt, &phys, &page_size, &flags)) + continue; + +#define MAP_AT_LEVEL(base, BASE, level, LEVEL) { \ + if (base##_large(*base)) { \ + VM_BUG_ON(PHYS_PFN(phys & BASE##_MASK) != \ + base##_pfn(*base)); \ + continue; \ + } \ + \ + level = asi_##level##_alloc(asi, base, virt, gfp_flags);\ + if (!level) \ + return -ENOMEM; \ + \ + if (page_size >= LEVEL##_SIZE && \ + (level##_none(*level) || level##_leaf(*level)) && \ + is_page_within_range(virt, LEVEL##_SIZE, \ + start, end)) { \ + page_size = LEVEL##_SIZE; \ + phys &= LEVEL##_MASK; \ + \ + if (level##_none(*level)) \ + set_##level(level, \ + __##level(phys | flags)); \ + else \ + VM_BUG_ON(level##_pfn(*level) != \ + PHYS_PFN(phys)); \ + continue; \ + } \ + } + + pgd = pgd_offset_pgd(asi->pgd, virt); + + MAP_AT_LEVEL(pgd, PGDIR, p4d, P4D); + MAP_AT_LEVEL(p4d, P4D, pud, PUD); + MAP_AT_LEVEL(pud, PUD, pmd, PMD); + MAP_AT_LEVEL(pmd, PMD, pte, PAGE); + + VM_BUG_ON(true); /* Should never reach here. */ +#undef MAP_AT_LEVEL + } + + return 0; +} + +int asi_map(struct asi *asi, void *addr, size_t len) +{ + return asi_map_gfp(asi, addr, len, GFP_KERNEL); +} + +/* + * Unmap a kernel address range previously mapped into the ASI page tables. + * The caller must ensure appropriate TLB flushing. + * + * The area being unmapped must be a whole previously mapped region (or regions) + * Unmapping a partial subset of a previously mapped region is not supported. + * That will work, but may end up unmapping more than what was asked for, if + * the mapping contained huge pages. + * + * Note that higher order direct map allocations are allowed to be partially + * freed. If it turns out that that actually happens for any of the + * non-sensitive allocations, then the above limitation may be a problem. For + * now, vunmap_pgd_range() will emit a warning if this situation is detected. + */ +void asi_unmap(struct asi *asi, void *addr, size_t len, bool flush_tlb) +{ + size_t start = (size_t)addr; + size_t end = start + len; + pgtbl_mod_mask mask = 0; + + if (!static_cpu_has(X86_FEATURE_ASI) || !len) + return; + + VM_BUG_ON(start & ~PAGE_MASK); + VM_BUG_ON(len & ~PAGE_MASK); + VM_BUG_ON(start < TASK_SIZE_MAX); + + vunmap_pgd_range(asi->pgd, start, end, &mask, false); + + if (flush_tlb) + asi_flush_tlb_range(asi, addr, len); +} + +void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len) +{ + /* Later patches will do a more optimized flush. */ + flush_tlb_kernel_range((ulong)addr, (ulong)addr + len); +} diff --git a/include/asm-generic/asi.h b/include/asm-generic/asi.h index dae1403ee1d0..7da91cbe075d 100644 --- a/include/asm-generic/asi.h +++ b/include/asm-generic/asi.h @@ -2,6 +2,8 @@ #ifndef __ASM_GENERIC_ASI_H #define __ASM_GENERIC_ASI_H +#include + /* ASI class flags */ #define ASI_MAP_STANDARD_NONSENSITIVE 1 @@ -44,6 +46,23 @@ static inline struct asi *asi_get_target(void) { return NULL; } static inline struct asi *asi_get_current(void) { return NULL; } +static inline +int asi_map_gfp(struct asi *asi, void *addr, size_t len, gfp_t gfp_flags) +{ + return 0; +} + +static inline int asi_map(struct asi *asi, void *addr, size_t len) +{ + return 0; +} + +static inline +void asi_unmap(struct asi *asi, void *addr, size_t len, bool flush_tlb) { } + +static inline +void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len) { } + #define static_asi_enabled() false #endif /* !_ASSEMBLY_ */ diff --git a/mm/internal.h b/mm/internal.h index 3b79a5c9427a..ae8799d86dd3 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -79,6 +79,9 @@ void unmap_page_range(struct mmu_gather *tlb, unsigned long addr, unsigned long end, struct zap_details *details); +void vunmap_pgd_range(pgd_t *pgd_table, unsigned long addr, unsigned long end, + pgtbl_mod_mask *mask, bool sleepable); + void do_page_cache_ra(struct readahead_control *, unsigned long nr_to_read, unsigned long lookahead_size); void force_page_cache_ra(struct readahead_control *, unsigned long nr); diff --git a/mm/vmalloc.c b/mm/vmalloc.c index d2a00ad4e1dd..f2ef719f1cba 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -336,7 +336,7 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, } static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, - pgtbl_mod_mask *mask) + pgtbl_mod_mask *mask, bool sleepable) { pmd_t *pmd; unsigned long next; @@ -350,18 +350,22 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, if (cleared || pmd_bad(*pmd)) *mask |= PGTBL_PMD_MODIFIED; - if (cleared) + if (cleared) { + WARN_ON(addr & ~PMD_MASK); + WARN_ON(next & ~PMD_MASK); continue; + } if (pmd_none_or_clear_bad(pmd)) continue; vunmap_pte_range(pmd, addr, next, mask); - cond_resched(); + if (sleepable) + cond_resched(); } while (pmd++, addr = next, addr != end); } static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, - pgtbl_mod_mask *mask) + pgtbl_mod_mask *mask, bool sleepable) { pud_t *pud; unsigned long next; @@ -375,16 +379,19 @@ static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, if (cleared || pud_bad(*pud)) *mask |= PGTBL_PUD_MODIFIED; - if (cleared) + if (cleared) { + WARN_ON(addr & ~PUD_MASK); + WARN_ON(next & ~PUD_MASK); continue; + } if (pud_none_or_clear_bad(pud)) continue; - vunmap_pmd_range(pud, addr, next, mask); + vunmap_pmd_range(pud, addr, next, mask, sleepable); } while (pud++, addr = next, addr != end); } static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end, - pgtbl_mod_mask *mask) + pgtbl_mod_mask *mask, bool sleepable) { p4d_t *p4d; unsigned long next; @@ -398,14 +405,35 @@ static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end, if (cleared || p4d_bad(*p4d)) *mask |= PGTBL_P4D_MODIFIED; - if (cleared) + if (cleared) { + WARN_ON(addr & ~P4D_MASK); + WARN_ON(next & ~P4D_MASK); continue; + } if (p4d_none_or_clear_bad(p4d)) continue; - vunmap_pud_range(p4d, addr, next, mask); + vunmap_pud_range(p4d, addr, next, mask, sleepable); } while (p4d++, addr = next, addr != end); } +void vunmap_pgd_range(pgd_t *pgd_table, unsigned long addr, unsigned long end, + pgtbl_mod_mask *mask, bool sleepable) +{ + unsigned long next; + pgd_t *pgd = pgd_offset_pgd(pgd_table, addr); + + BUG_ON(addr >= end); + + do { + next = pgd_addr_end(addr, end); + if (pgd_bad(*pgd)) + *mask |= PGTBL_PGD_MODIFIED; + if (pgd_none_or_clear_bad(pgd)) + continue; + vunmap_p4d_range(pgd, addr, next, mask, sleepable); + } while (pgd++, addr = next, addr != end); +} + /* * vunmap_range_noflush is similar to vunmap_range, but does not * flush caches or TLBs. @@ -420,21 +448,9 @@ static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end, */ void vunmap_range_noflush(unsigned long start, unsigned long end) { - unsigned long next; - pgd_t *pgd; - unsigned long addr = start; pgtbl_mod_mask mask = 0; - BUG_ON(addr >= end); - pgd = pgd_offset_k(addr); - do { - next = pgd_addr_end(addr, end); - if (pgd_bad(*pgd)) - mask |= PGTBL_PGD_MODIFIED; - if (pgd_none_or_clear_bad(pgd)) - continue; - vunmap_p4d_range(pgd, addr, next, &mask); - } while (pgd++, addr = next, addr != end); + vunmap_pgd_range(init_mm.pgd, start, end, &mask, true); if (mask & ARCH_PAGE_TABLE_SYNC_MASK) arch_sync_kernel_mappings(start, end); From patchwork Wed Feb 23 05:21:44 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756414 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0E897C433EF for ; Wed, 23 Feb 2022 05:24:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238211AbiBWFYz (ORCPT ); Wed, 23 Feb 2022 00:24:55 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57628 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237674AbiBWFYw (ORCPT ); Wed, 23 Feb 2022 00:24:52 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3C9596A033 for ; Tue, 22 Feb 2022 21:24:04 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-2d07ae11460so162180387b3.7 for ; Tue, 22 Feb 2022 21:24:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=rf+PEwSCAjx96OEampym4lZ49QLmyoHK2vvPo63tKDU=; b=ClDNkRcR05xwUIt+/3suGkjgzNs7jx8crmYYSBy7YIfyMwXpzxpsuO7L5+YX8vxRfh 1E4PSbB3uEBG+XkPHeMZghVckls3s0XgENlVkTaPWZjkwqbjhFcBMXRc0pNft0u4lgb5 OHkuB86fnETxs+Q/fn9lSwDJhVOhXOzfA1+wbVB6gzJyDg9tXEAFaT4CQ0u+HBjQvZbK xaLThn55HGR54AO8ID3Zi3c8s3jxcmEzd75ic5aZDYq6ErsSaIKBY/8EVr+KDDNXwZv5 bVHGd/T3R6cR1QrhXBnqEawoY6M01XywGeiIoQ1z1l9A03h+cXgMumLSGriK6mCj3H15 pxlQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=rf+PEwSCAjx96OEampym4lZ49QLmyoHK2vvPo63tKDU=; b=TWS0cCz8A+OhmZD2/8xVhmuCL/FEZejQzsc1AF0KCWPxPOVzH9OS1BZAsIbpvWi0e5 UbTsr593hCyMj/KsI9eN/kuA4e5V5+IKbSQdid32i4iEoUPfN/n0IF+oI6AdazYiJBcE O5CInHSyYM6ztTsjh+88D61ic5G8OF+HUwHyyizZ6E+6upwFJHYh/zosKLuWmV85xLhI 6X1LKTM/S7AXvhLvzXWAvky15cbXahmQ+63AwAWxCtuu8VjXSNSH3jMyfENHdZfxjbGW tQEmOLuJHhD6VQEN1n0JHgxiJ62pU0VX8/s1MkZPWze9JYWhhbgnaZwRReZVj94kphCz SolA== X-Gm-Message-State: AOAM531jNedDZQr7BiqWSBZHAKfwFo9WJRibuxgful0ShgB4YaZnXR6o unW10KfnndNfZD0EOF61QsA1OyROpj4l X-Google-Smtp-Source: ABdhPJxAE8giXqGIDJh7Mviar4JsaaTJZtQQuDBTZF65UTbW8rTNqxt0+PJdM/jBytNMKHkW6V66FJSXIQ6n X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a05:6902:108:b0:621:165e:5c1e with SMTP id o8-20020a056902010800b00621165e5c1emr25436069ybh.204.1645593843385; Tue, 22 Feb 2022 21:24:03 -0800 (PST) Date: Tue, 22 Feb 2022 21:21:44 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-9-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 08/47] mm: asi: Add basic infrastructure for global non-sensitive mappings From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org A pseudo-PGD is added to store global non-sensitive ASI mappings. Actual ASI PGDs copy entries from this pseudo-PGD during asi_init(). Memory can be mapped as globally non-sensitive by calling asi_map() with ASI_GLOBAL_NONSENSITIVE. Page tables allocated for global non-sensitive mappings are never freed. Signed-off-by: Junaid Shahid --- arch/x86/include/asm/asi.h | 12 ++++++++++++ arch/x86/mm/asi.c | 36 +++++++++++++++++++++++++++++++++++- arch/x86/mm/init_64.c | 26 +++++++++++++++++--------- arch/x86/mm/mm_internal.h | 3 +++ include/asm-generic/asi.h | 5 +++++ mm/init-mm.c | 2 ++ 6 files changed, 74 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h index 521b40d1864b..64c2b4d1dba2 100644 --- a/arch/x86/include/asm/asi.h +++ b/arch/x86/include/asm/asi.h @@ -15,6 +15,8 @@ #define ASI_MAX_NUM_ORDER 2 #define ASI_MAX_NUM (1 << ASI_MAX_NUM_ORDER) +#define ASI_GLOBAL_NONSENSITIVE (&init_mm.asi[0]) + struct asi_state { struct asi *curr_asi; struct asi *target_asi; @@ -41,6 +43,8 @@ struct asi { DECLARE_PER_CPU_ALIGNED(struct asi_state, asi_cpu_state); +extern pgd_t asi_global_nonsensitive_pgd[]; + void asi_init_mm_state(struct mm_struct *mm); int asi_register_class(const char *name, uint flags, @@ -117,6 +121,14 @@ static inline void asi_intr_exit(void) } } +#define INIT_MM_ASI(init_mm) \ + .asi = { \ + [0] = { \ + .pgd = asi_global_nonsensitive_pgd, \ + .mm = &init_mm \ + } \ + }, + static inline pgd_t *asi_pgd(struct asi *asi) { return asi->pgd; diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c index 84d220cbdcfc..d381ae573af9 100644 --- a/arch/x86/mm/asi.c +++ b/arch/x86/mm/asi.c @@ -1,11 +1,13 @@ // SPDX-License-Identifier: GPL-2.0 #include +#include #include #include #include +#include "mm_internal.h" #include "../../../mm/internal.h" #undef pr_fmt @@ -17,6 +19,8 @@ static DEFINE_SPINLOCK(asi_class_lock); DEFINE_PER_CPU_ALIGNED(struct asi_state, asi_cpu_state); EXPORT_PER_CPU_SYMBOL_GPL(asi_cpu_state); +__aligned(PAGE_SIZE) pgd_t asi_global_nonsensitive_pgd[PTRS_PER_PGD]; + int asi_register_class(const char *name, uint flags, const struct asi_hooks *ops) { @@ -160,12 +164,17 @@ static void asi_free_pgd_range(struct asi *asi, uint start, uint end) * Free the page tables allocated for the given ASI instance. * The caller must ensure that all the mappings have already been cleared * and appropriate TLB flushes have been issued before calling this function. + * + * For standard non-sensitive ASI classes, the page tables shared with the + * master pseudo-PGD are not freed. */ static void asi_free_pgd(struct asi *asi) { VM_BUG_ON(asi->mm == &init_mm); - asi_free_pgd_range(asi, KERNEL_PGD_BOUNDARY, PTRS_PER_PGD); + if (!(asi->class->flags & ASI_MAP_STANDARD_NONSENSITIVE)) + asi_free_pgd_range(asi, KERNEL_PGD_BOUNDARY, PTRS_PER_PGD); + free_pages((ulong)asi->pgd, PGD_ALLOCATION_ORDER); } @@ -178,6 +187,24 @@ static int __init set_asi_param(char *str) } early_param("asi", set_asi_param); +static int __init asi_global_init(void) +{ + if (!boot_cpu_has(X86_FEATURE_ASI)) + return 0; + + preallocate_toplevel_pgtbls(asi_global_nonsensitive_pgd, + PAGE_OFFSET, + PAGE_OFFSET + PFN_PHYS(max_possible_pfn) - 1, + "ASI Global Non-sensitive direct map"); + + preallocate_toplevel_pgtbls(asi_global_nonsensitive_pgd, + VMALLOC_START, VMALLOC_END, + "ASI Global Non-sensitive vmalloc"); + + return 0; +} +subsys_initcall(asi_global_init) + int asi_init(struct mm_struct *mm, int asi_index) { struct asi *asi = &mm->asi[asi_index]; @@ -202,6 +229,13 @@ int asi_init(struct mm_struct *mm, int asi_index) asi->class = &asi_class[asi_index]; asi->mm = mm; + if (asi->class->flags & ASI_MAP_STANDARD_NONSENSITIVE) { + uint i; + + for (i = KERNEL_PGD_BOUNDARY; i < PTRS_PER_PGD; i++) + set_pgd(asi->pgd + i, asi_global_nonsensitive_pgd[i]); + } + return 0; } EXPORT_SYMBOL_GPL(asi_init); diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 36098226a957..ebd512c64ed0 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1277,18 +1277,15 @@ static void __init register_page_bootmem_info(void) #endif } -/* - * Pre-allocates page-table pages for the vmalloc area in the kernel page-table. - * Only the level which needs to be synchronized between all page-tables is - * allocated because the synchronization can be expensive. - */ -static void __init preallocate_vmalloc_pages(void) +void __init preallocate_toplevel_pgtbls(pgd_t *pgd_table, + ulong start, ulong end, + const char *name) { unsigned long addr; const char *lvl; - for (addr = VMALLOC_START; addr <= VMALLOC_END; addr = ALIGN(addr + 1, PGDIR_SIZE)) { - pgd_t *pgd = pgd_offset_k(addr); + for (addr = start; addr <= end; addr = ALIGN(addr + 1, PGDIR_SIZE)) { + pgd_t *pgd = pgd_offset_pgd(pgd_table, addr); p4d_t *p4d; pud_t *pud; @@ -1324,7 +1321,18 @@ static void __init preallocate_vmalloc_pages(void) * The pages have to be there now or they will be missing in * process page-tables later. */ - panic("Failed to pre-allocate %s pages for vmalloc area\n", lvl); + panic("Failed to pre-allocate %s pages for %s area\n", lvl, name); +} + +/* + * Pre-allocates page-table pages for the vmalloc area in the kernel page-table. + * Only the level which needs to be synchronized between all page-tables is + * allocated because the synchronization can be expensive. + */ +static void __init preallocate_vmalloc_pages(void) +{ + preallocate_toplevel_pgtbls(init_mm.pgd, VMALLOC_START, VMALLOC_END, + "vmalloc"); } void __init mem_init(void) diff --git a/arch/x86/mm/mm_internal.h b/arch/x86/mm/mm_internal.h index 3f37b5c80bb3..a1e8c523ab08 100644 --- a/arch/x86/mm/mm_internal.h +++ b/arch/x86/mm/mm_internal.h @@ -19,6 +19,9 @@ unsigned long kernel_physical_mapping_change(unsigned long start, unsigned long page_size_mask); void zone_sizes_init(void); +void preallocate_toplevel_pgtbls(pgd_t *pgd_table, ulong start, ulong end, + const char *name); + extern int after_bootmem; void update_cache_mode_entry(unsigned entry, enum page_cache_mode cache); diff --git a/include/asm-generic/asi.h b/include/asm-generic/asi.h index 7da91cbe075d..012691e29895 100644 --- a/include/asm-generic/asi.h +++ b/include/asm-generic/asi.h @@ -12,6 +12,8 @@ #define ASI_MAX_NUM_ORDER 0 #define ASI_MAX_NUM 0 +#define ASI_GLOBAL_NONSENSITIVE NULL + #ifndef _ASSEMBLY_ struct asi_hooks {}; @@ -63,8 +65,11 @@ void asi_unmap(struct asi *asi, void *addr, size_t len, bool flush_tlb) { } static inline void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len) { } +#define INIT_MM_ASI(init_mm) + #define static_asi_enabled() false + #endif /* !_ASSEMBLY_ */ #endif /* !CONFIG_ADDRESS_SPACE_ISOLATION */ diff --git a/mm/init-mm.c b/mm/init-mm.c index b4a6f38fb51d..47a6a66610fb 100644 --- a/mm/init-mm.c +++ b/mm/init-mm.c @@ -11,6 +11,7 @@ #include #include #include +#include #ifndef INIT_MM_CONTEXT #define INIT_MM_CONTEXT(name) @@ -38,6 +39,7 @@ struct mm_struct init_mm = { .mmlist = LIST_HEAD_INIT(init_mm.mmlist), .user_ns = &init_user_ns, .cpu_bitmap = CPU_BITS_NONE, + INIT_MM_ASI(init_mm) INIT_MM_CONTEXT(init_mm) }; From patchwork Wed Feb 23 05:21:45 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756415 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8AFA5C4332F for ; Wed, 23 Feb 2022 05:24:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238243AbiBWFY6 (ORCPT ); Wed, 23 Feb 2022 00:24:58 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57668 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238180AbiBWFYx (ORCPT ); Wed, 23 Feb 2022 00:24:53 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A8C086A076 for ; Tue, 22 Feb 2022 21:24:06 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id b64-20020a256743000000b0061e169a5f19so26554460ybc.11 for ; Tue, 22 Feb 2022 21:24:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=A2kZT9A4P8bW0lI9YOtf1wQA9jJkQ1koWvhOh//ocUo=; b=fbSsueZGuBScgzqcY02VIl0ILFLXpPlKdSo5ctJYl8kSPI7w0kDZ1uGY8ujtC+fl+e PEzsS5j+6h2lyRr/UrTBatrXvU4f3NErvUAR7v4X1dzKu33iDdHE2QCAop5cuA3atf5U EJ4V1Fx706MmP14iSjNuJdP3U5u2B16C5Fj3e/UoRc8fZC4srXBJUArKZJ67Qxmu0hNA BsPOvLUKtyHj9nuWAemSw1G1+wDLXUYyYOFHHc5fb0JSBY7PBPNYf4tS05DMNxCxy6ed 1VKdWkC7HesYHDOsvNnF5BnhxEmTplkx92VqdXNvb7r03sGqA/SDTfz8vVfIQeADD5L1 ir/Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=A2kZT9A4P8bW0lI9YOtf1wQA9jJkQ1koWvhOh//ocUo=; b=k4RxFUAX7WxWvfJ8lAFn/Ol+NpBr/KR2h3AlyQ+Uqn2Tq2PJULT3JAWbU2W85M6PNr TCZrPDgbGv+GMqX6STVZ5KnGlT41556pP0Nym8Ztam9XbrlVQ6duiSyIvLfHUQL3+9Ls 8HGlGxP6Q4pPVC2h1BGfH3eKCsBqBBWXEMKHRWKCYIKPzUFg+W4CPTkvAQFjowecG3bZ RRzipbWuwcaXNjBb/BHAkKgQIh3zdgsOV7SBCr73Db3qRF8CtIYRs9YVVkH4dt8JgKDc pi8U6F1rZuafeXmTl82M3QAc7xInOCEOoPyGU4EATPyru+mDLWqUmnfsRqLu4bq17EIp rpCw== X-Gm-Message-State: AOAM533LdPR4oD0pvH/2xr+G4D4LHfJcq+U9282mlKtuTWVIU0jD86+9 ZxWrxwGoTD5uS4qCjA9dNFPqhTlPHmM0 X-Google-Smtp-Source: ABdhPJx1fLXJ2Nd+XyAyj3SwtD/6yx9YWclO/RpwDKJJBLCMeXbY/FNMqCpTASogEOd+LsJ96uLWXmi1HY+9 X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a81:1d8c:0:b0:2cb:da76:5da8 with SMTP id d134-20020a811d8c000000b002cbda765da8mr27707177ywd.165.1645593845809; Tue, 22 Feb 2022 21:24:05 -0800 (PST) Date: Tue, 22 Feb 2022 21:21:45 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-10-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 09/47] mm: Add __PAGEFLAG_FALSE From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org __PAGEFLAG_FALSE is a non-atomic equivalent of PAGEFLAG_FALSE. Signed-off-by: Junaid Shahid --- include/linux/page-flags.h | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index b5f14d581113..b90a17e9796d 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -390,6 +390,10 @@ static inline int Page##uname(const struct page *page) { return 0; } static inline void folio_set_##lname(struct folio *folio) { } \ static inline void SetPage##uname(struct page *page) { } +#define __SETPAGEFLAG_NOOP(uname, lname) \ +static inline void __folio_set_##lname(struct folio *folio) { } \ +static inline void __SetPage##uname(struct page *page) { } + #define CLEARPAGEFLAG_NOOP(uname, lname) \ static inline void folio_clear_##lname(struct folio *folio) { } \ static inline void ClearPage##uname(struct page *page) { } @@ -411,6 +415,9 @@ static inline int TestClearPage##uname(struct page *page) { return 0; } #define PAGEFLAG_FALSE(uname, lname) TESTPAGEFLAG_FALSE(uname, lname) \ SETPAGEFLAG_NOOP(uname, lname) CLEARPAGEFLAG_NOOP(uname, lname) +#define __PAGEFLAG_FALSE(uname, lname) TESTPAGEFLAG_FALSE(uname, lname) \ + __SETPAGEFLAG_NOOP(uname, lname) __CLEARPAGEFLAG_NOOP(uname, lname) + #define TESTSCFLAG_FALSE(uname, lname) \ TESTSETFLAG_FALSE(uname, lname) TESTCLEARFLAG_FALSE(uname, lname) From patchwork Wed Feb 23 05:21:46 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756416 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DD50CC433FE for ; Wed, 23 Feb 2022 05:24:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238298AbiBWFZD (ORCPT ); Wed, 23 Feb 2022 00:25:03 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57746 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238207AbiBWFYz (ORCPT ); Wed, 23 Feb 2022 00:24:55 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 090EF6C1D9 for ; Tue, 22 Feb 2022 21:24:08 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-2d07ae11464so162253827b3.14 for ; Tue, 22 Feb 2022 21:24:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=gMgbv5PxdERvilfG13W1/59w3Y4MPufgzH9ipT5mZcE=; b=b6urhxkWGfnvarQIxVi9atsKIrKLwBXqKa9HpQGtESTNRqGIm9vkGf3iqdrTrF3WZR 5m1o2lHCH4V6NfBZsxlk8GhGN3fJPzlR3rYLD19suBuPDuixp1NPgPg6U6hpUzZvTY5L IgBE58eLo2ZZzbqp5xEwBHiVFVtDJ87nrKLza6quhghsvVIu54Q0IHEcJ3cNhC/sxTwh jZftmgJSZ3zhdFt219bc3VfY34rKbPUl3RGNBdUPaeK8YOXIcRYVBQ85kA7yXhGQqR5+ DMF/TwW/QA+HqXhMMcxPWsADaTGbuLYcULrIcRh5s/jS84XHXJ0eFx/xBfm7XL6hFrwa +MKQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=gMgbv5PxdERvilfG13W1/59w3Y4MPufgzH9ipT5mZcE=; b=Wave+P5sIlJuS6nA+lL6uV56I3kj/3Ds3FbIJlj5CLZ3lWbqV5B9/DuFKrTjEvZMnh g/ooRlEEsbMC86xsXtgJD+iB7A0KsRFyAYOcvSRTtGrq24NAhochRiXwNXPOPeOZvfiB drwxEXky3NNrHh1xSG9GDBeUdnpUgGIah+WAzPjfBA9GaNuLFcYULDs35vSpg8BxmHR/ gcGgmH401jEGCWTrw1dS/uYJmJ/jKKVohebYfp4eRetz6s2nluFe6KDVytshiM1jzWWR uiteSznU9fTwUPrDtVWIG3ArRIV5yqoax2TsXGhChbVdbREAC0sLlhjCV27hFW6ngsYk 9qVA== X-Gm-Message-State: AOAM531xcyGO+vnZ011PzXb8FpjoL4u1gxfTF5h7ZKg/aVj7eeRx8VBC NZbmqAmJ+bzAuOsJjhfXq4X8EoEPnnac X-Google-Smtp-Source: ABdhPJxT/vWX0OjiErFxevki8JUky5j473dKCodPTix92VCbsXfh3r9zdc2C7QnCBcd9ntaYhKL50joC8J3g X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a81:5d0:0:b0:2d0:d056:c703 with SMTP id 199-20020a8105d0000000b002d0d056c703mr28080196ywf.288.1645593847896; Tue, 22 Feb 2022 21:24:07 -0800 (PST) Date: Tue, 22 Feb 2022 21:21:46 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-11-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 10/47] mm: asi: Support for global non-sensitive direct map allocations From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org A new GFP flag is added to specify that an allocation should be considered globally non-sensitive. The pages will be mapped into the ASI global non-sensitive pseudo-PGD, which is shared between all standard ASI instances. A new page flag is also added so that when these pages are freed, they can also be unmapped from the ASI page tables. Signed-off-by: Junaid Shahid --- include/linux/gfp.h | 10 ++- include/linux/mm_types.h | 5 ++ include/linux/page-flags.h | 9 ++ include/trace/events/mmflags.h | 12 ++- mm/page_alloc.c | 145 ++++++++++++++++++++++++++++++++- tools/perf/builtin-kmem.c | 1 + 6 files changed, 178 insertions(+), 4 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 8fcc38467af6..07a99a463a34 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -60,6 +60,11 @@ struct vm_area_struct; #else #define ___GFP_NOLOCKDEP 0 #endif +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +#define ___GFP_GLOBAL_NONSENSITIVE 0x4000000u +#else +#define ___GFP_GLOBAL_NONSENSITIVE 0 +#endif /* If the above are modified, __GFP_BITS_SHIFT may need updating */ /* @@ -248,8 +253,11 @@ struct vm_area_struct; /* Disable lockdep for GFP context tracking */ #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP) +/* Allocate non-sensitive memory */ +#define __GFP_GLOBAL_NONSENSITIVE ((__force gfp_t)___GFP_GLOBAL_NONSENSITIVE) + /* Room for N __GFP_FOO bits */ -#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP)) +#define __GFP_BITS_SHIFT 27 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) /** diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 3de1afa57289..5b8028fcfe67 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -191,6 +191,11 @@ struct page { /** @rcu_head: You can use this to free a page by RCU. */ struct rcu_head rcu_head; + +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + /* Links the pages_to_free_async list */ + struct llist_node async_free_node; +#endif }; union { /* This union is 4 bytes in size. */ diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index b90a17e9796d..a07434cc679c 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -140,6 +140,9 @@ enum pageflags { #endif #ifdef CONFIG_KASAN_HW_TAGS PG_skip_kasan_poison, +#endif +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + PG_global_nonsensitive, #endif __NR_PAGEFLAGS, @@ -542,6 +545,12 @@ TESTCLEARFLAG(Young, young, PF_ANY) PAGEFLAG(Idle, idle, PF_ANY) #endif +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +__PAGEFLAG(GlobalNonSensitive, global_nonsensitive, PF_ANY); +#else +__PAGEFLAG_FALSE(GlobalNonSensitive, global_nonsensitive); +#endif + #ifdef CONFIG_KASAN_HW_TAGS PAGEFLAG(SkipKASanPoison, skip_kasan_poison, PF_HEAD) #else diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index 116ed4d5d0f8..73a49197ef54 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -50,7 +50,8 @@ {(unsigned long)__GFP_DIRECT_RECLAIM, "__GFP_DIRECT_RECLAIM"},\ {(unsigned long)__GFP_KSWAPD_RECLAIM, "__GFP_KSWAPD_RECLAIM"},\ {(unsigned long)__GFP_ZEROTAGS, "__GFP_ZEROTAGS"}, \ - {(unsigned long)__GFP_SKIP_KASAN_POISON,"__GFP_SKIP_KASAN_POISON"}\ + {(unsigned long)__GFP_SKIP_KASAN_POISON,"__GFP_SKIP_KASAN_POISON"},\ + {(unsigned long)__GFP_GLOBAL_NONSENSITIVE, "__GFP_GLOBAL_NONSENSITIVE"}\ #define show_gfp_flags(flags) \ (flags) ? __print_flags(flags, "|", \ @@ -93,6 +94,12 @@ #define IF_HAVE_PG_SKIP_KASAN_POISON(flag,string) #endif +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +#define IF_HAVE_ASI(flag, string) ,{1UL << flag, string} +#else +#define IF_HAVE_ASI(flag, string) +#endif + #define __def_pageflag_names \ {1UL << PG_locked, "locked" }, \ {1UL << PG_waiters, "waiters" }, \ @@ -121,7 +128,8 @@ IF_HAVE_PG_HWPOISON(PG_hwpoison, "hwpoison" ) \ IF_HAVE_PG_IDLE(PG_young, "young" ) \ IF_HAVE_PG_IDLE(PG_idle, "idle" ) \ IF_HAVE_PG_ARCH_2(PG_arch_2, "arch_2" ) \ -IF_HAVE_PG_SKIP_KASAN_POISON(PG_skip_kasan_poison, "skip_kasan_poison") +IF_HAVE_PG_SKIP_KASAN_POISON(PG_skip_kasan_poison, "skip_kasan_poison") \ +IF_HAVE_ASI(PG_global_nonsensitive, "global_nonsensitive") #define show_page_flags(flags) \ (flags) ? __print_flags(flags, "|", \ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c5952749ad40..a4048fa1868a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -697,7 +697,7 @@ static inline bool pcp_allowed_order(unsigned int order) return false; } -static inline void free_the_page(struct page *page, unsigned int order) +static inline void __free_the_page(struct page *page, unsigned int order) { if (pcp_allowed_order(order)) /* Via pcp? */ free_unref_page(page, order); @@ -705,6 +705,14 @@ static inline void free_the_page(struct page *page, unsigned int order) __free_pages_ok(page, order, FPI_NONE); } +static bool asi_unmap_freed_pages(struct page *page, unsigned int order); + +static inline void free_the_page(struct page *page, unsigned int order) +{ + if (asi_unmap_freed_pages(page, order)) + __free_the_page(page, order); +} + /* * Higher-order pages are called "compound pages". They are structured thusly: * @@ -5162,6 +5170,129 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order, return true; } +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + +static DEFINE_PER_CPU(struct work_struct, async_free_work); +static DEFINE_PER_CPU(struct llist_head, pages_to_free_async); +static bool async_free_work_initialized; + +static void __free_the_page(struct page *page, unsigned int order); + +static void async_free_work_fn(struct work_struct *work) +{ + struct page *page, *tmp; + struct llist_node *pages_to_free; + void *va; + size_t len; + uint order; + + pages_to_free = llist_del_all(this_cpu_ptr(&pages_to_free_async)); + + /* A later patch will do a more optimized TLB flush. */ + + llist_for_each_entry_safe(page, tmp, pages_to_free, async_free_node) { + va = page_to_virt(page); + order = page->private; + len = PAGE_SIZE * (1 << order); + + asi_flush_tlb_range(ASI_GLOBAL_NONSENSITIVE, va, len); + __free_the_page(page, order); + } +} + +static int __init asi_page_alloc_init(void) +{ + int cpu; + + if (!static_asi_enabled()) + return 0; + + for_each_possible_cpu(cpu) + INIT_WORK(per_cpu_ptr(&async_free_work, cpu), + async_free_work_fn); + + /* + * This function is called before SMP is initialized, so we can assume + * that this is the only running CPU at this point. + */ + + barrier(); + async_free_work_initialized = true; + barrier(); + + if (!llist_empty(this_cpu_ptr(&pages_to_free_async))) + queue_work_on(smp_processor_id(), mm_percpu_wq, + this_cpu_ptr(&async_free_work)); + + return 0; +} +early_initcall(asi_page_alloc_init); + +static int asi_map_alloced_pages(struct page *page, uint order, gfp_t gfp_mask) +{ + uint i; + + if (!static_asi_enabled()) + return 0; + + if (gfp_mask & __GFP_GLOBAL_NONSENSITIVE) { + for (i = 0; i < (1 << order); i++) + __SetPageGlobalNonSensitive(page + i); + + return asi_map_gfp(ASI_GLOBAL_NONSENSITIVE, page_to_virt(page), + PAGE_SIZE * (1 << order), gfp_mask); + } + + return 0; +} + +static bool asi_unmap_freed_pages(struct page *page, unsigned int order) +{ + void *va; + size_t len; + bool async_flush_needed; + + if (!static_asi_enabled()) + return true; + + if (!PageGlobalNonSensitive(page)) + return true; + + va = page_to_virt(page); + len = PAGE_SIZE * (1 << order); + async_flush_needed = irqs_disabled() || in_interrupt(); + + asi_unmap(ASI_GLOBAL_NONSENSITIVE, va, len, !async_flush_needed); + + if (!async_flush_needed) + return true; + + page->private = order; + llist_add(&page->async_free_node, this_cpu_ptr(&pages_to_free_async)); + + if (async_free_work_initialized) + queue_work_on(smp_processor_id(), mm_percpu_wq, + this_cpu_ptr(&async_free_work)); + + return false; +} + +#else /* CONFIG_ADDRESS_SPACE_ISOLATION */ + +static inline +int asi_map_alloced_pages(struct page *pages, uint order, gfp_t gfp_mask) +{ + return 0; +} + +static inline +bool asi_unmap_freed_pages(struct page *page, unsigned int order) +{ + return true; +} + +#endif + /* * __alloc_pages_bulk - Allocate a number of order-0 pages to a list or array * @gfp: GFP flags for the allocation @@ -5345,6 +5476,9 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid, return NULL; } + if (static_asi_enabled() && (gfp & __GFP_GLOBAL_NONSENSITIVE)) + gfp |= __GFP_ZERO; + gfp &= gfp_allowed_mask; /* * Apply scoped allocation constraints. This is mainly about GFP_NOFS @@ -5388,6 +5522,15 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid, page = NULL; } + if (page) { + int err = asi_map_alloced_pages(page, order, gfp); + + if (unlikely(err)) { + __free_pages(page, order); + page = NULL; + } + } + trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype); return page; diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c index da03a341c63c..5857953cd5c1 100644 --- a/tools/perf/builtin-kmem.c +++ b/tools/perf/builtin-kmem.c @@ -660,6 +660,7 @@ static const struct { { "__GFP_RECLAIM", "R" }, { "__GFP_DIRECT_RECLAIM", "DR" }, { "__GFP_KSWAPD_RECLAIM", "KR" }, + { "__GFP_GLOBAL_NONSENSITIVE", "GNS" }, }; static size_t max_gfp_len; From patchwork Wed Feb 23 05:21:47 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756418 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BD639C433F5 for ; Wed, 23 Feb 2022 05:24:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238208AbiBWFZP (ORCPT ); Wed, 23 Feb 2022 00:25:15 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56900 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238224AbiBWFY4 (ORCPT ); Wed, 23 Feb 2022 00:24:56 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1A65D6C930 for ; Tue, 22 Feb 2022 21:24:11 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id b64-20020a256743000000b0061e169a5f19so26554573ybc.11 for ; Tue, 22 Feb 2022 21:24:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=ZeYN5P9RWE85rWG989hD+TytaZQUQYUlju5cKer3rg4=; b=ojUJsUEL9fv13WjcZCmN21LxwR3Z6zqiLpmAQSzxCgc5C9RlKzd4dJEPqIdea+S3R3 SjjKai+zXn4VwTi+vC8vw+NDJWHGEkF4oYybv8WUgGBs6X4Bacon5BeJ8KDM6+Kjcqg7 fYXSLYWwuUfF3ZzaHWiW5r5fUjU5hUbi5i8udL9OUGPQxFLT6BZJN8cKiUG2KSidmSdI 7UxdullkH2J71pHhwQMQjuVH7Fyee/6awuNUgdr31148ohQL1xZz1MmOnDGr6y/5AC05 atbUvniVVa19eg7E36/l3gFWXXYduMYcF6MD2CAuUTHeqJzSqvtlhwblL5Fzis+kQ6SQ claQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=ZeYN5P9RWE85rWG989hD+TytaZQUQYUlju5cKer3rg4=; b=fqAJny8XpmNBfKazVcafqF1HaTbizsOAJdRWuOs64pC0M4CbMxaEfl2lG54SGbk4vt oPstX8WflrELWucO9prH8N7Ja8HlWrKxBPNB6L/eJHcNFdqlv5hplZqCdpNyQzKp15zR Olh5as5Mah0l+Cr9+eXtG4qv7p84E9cjX1x2VX17u0dAuHdTPsSqtdb5tujQZndpb/Sg tZarFpGzxYRZOOL+gU8ZPvEEurlolIxg4fmIzFrF500tVR3cF6oYPfmiRDeSmdYcQIjw QcGvTk/vMrnflsX6356U7p+d+hthoVtI/WPAjZoB1QtxZO2r042aK6l/Afys/p/bE9+W xxVg== X-Gm-Message-State: AOAM532DugsLKDBRr0aEI0VDjDFTAXetQnQSAEKUjI9+dpNzguLCNFua D9a8Sm30/VKdXses3it/JrErFN86iiw4 X-Google-Smtp-Source: ABdhPJx7Y4lP1+DVrEJyeaCPqdjSHoYWrsMiEOQNVQ/p8prXsXPbox8dmisMikykjaQnA+B25uCbwTFY5Vh8 X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a5b:589:0:b0:61d:de51:9720 with SMTP id l9-20020a5b0589000000b0061dde519720mr26317731ybp.167.1645593850281; Tue, 22 Feb 2022 21:24:10 -0800 (PST) Date: Tue, 22 Feb 2022 21:21:47 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-12-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 11/47] mm: asi: Global non-sensitive vmalloc/vmap support From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org A new flag, VM_GLOBAL_NONSENSITIVE is added to designate globally non-sensitive vmalloc/vmap areas. When using the __vmalloc / __vmalloc_node APIs, if the corresponding GFP flag is specified, the VM flag is automatically added. When using the __vmalloc_node_range API, either flag can be specified independently. The VM flag will only map the vmalloc area as non-sensitive, while the GFP flag will only map the underlying direct map area as non-sensitive. When using the __vmalloc_node_range API, instead of VMALLOC_START/END, VMALLOC_GLOBAL_NONSENSITIVE_START/END should be used. This is to keep these mappings separate from locally non-sensitive vmalloc areas, which will be added later. Areas outside of the standard vmalloc range can specify the range as before. Signed-off-by: Junaid Shahid --- arch/x86/include/asm/pgtable_64_types.h | 5 +++ arch/x86/mm/asi.c | 3 +- include/asm-generic/asi.h | 3 ++ include/linux/vmalloc.h | 6 +++ mm/vmalloc.c | 53 ++++++++++++++++++++++--- 5 files changed, 64 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h index 91ac10654570..0fc380ba25b8 100644 --- a/arch/x86/include/asm/pgtable_64_types.h +++ b/arch/x86/include/asm/pgtable_64_types.h @@ -141,6 +141,11 @@ extern unsigned int ptrs_per_p4d; #define VMALLOC_END (VMALLOC_START + (VMALLOC_SIZE_TB << 40) - 1) +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +#define VMALLOC_GLOBAL_NONSENSITIVE_START VMALLOC_START +#define VMALLOC_GLOBAL_NONSENSITIVE_END VMALLOC_END +#endif + #define MODULES_VADDR (__START_KERNEL_map + KERNEL_IMAGE_SIZE) /* The module sections ends with the start of the fixmap */ #ifndef CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c index d381ae573af9..71348399baf1 100644 --- a/arch/x86/mm/asi.c +++ b/arch/x86/mm/asi.c @@ -198,7 +198,8 @@ static int __init asi_global_init(void) "ASI Global Non-sensitive direct map"); preallocate_toplevel_pgtbls(asi_global_nonsensitive_pgd, - VMALLOC_START, VMALLOC_END, + VMALLOC_GLOBAL_NONSENSITIVE_START, + VMALLOC_GLOBAL_NONSENSITIVE_END, "ASI Global Non-sensitive vmalloc"); return 0; diff --git a/include/asm-generic/asi.h b/include/asm-generic/asi.h index 012691e29895..f918cd052722 100644 --- a/include/asm-generic/asi.h +++ b/include/asm-generic/asi.h @@ -14,6 +14,9 @@ #define ASI_GLOBAL_NONSENSITIVE NULL +#define VMALLOC_GLOBAL_NONSENSITIVE_START VMALLOC_START +#define VMALLOC_GLOBAL_NONSENSITIVE_END VMALLOC_END + #ifndef _ASSEMBLY_ struct asi_hooks {}; diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 6e022cc712e6..c7c66decda3e 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -39,6 +39,12 @@ struct notifier_block; /* in notifier.h */ * determine which allocations need the module shadow freed. */ +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +#define VM_GLOBAL_NONSENSITIVE 0x00000800 /* Similar to __GFP_GLOBAL_NONSENSITIVE */ +#else +#define VM_GLOBAL_NONSENSITIVE 0 +#endif + /* bits [20..32] reserved for arch specific ioremap internals */ /* diff --git a/mm/vmalloc.c b/mm/vmalloc.c index f2ef719f1cba..ba588a37ee75 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2393,6 +2393,33 @@ void __init vmalloc_init(void) vmap_initialized = true; } +static int asi_map_vm_area(struct vm_struct *area) +{ + if (!static_asi_enabled()) + return 0; + + if (area->flags & VM_GLOBAL_NONSENSITIVE) + return asi_map(ASI_GLOBAL_NONSENSITIVE, area->addr, + get_vm_area_size(area)); + + return 0; +} + +static void asi_unmap_vm_area(struct vm_struct *area) +{ + if (!static_asi_enabled()) + return; + + /* + * TODO: The TLB flush here could potentially be avoided in + * the case when the existing flush from try_purge_vmap_area_lazy() + * and/or vm_unmap_aliases() happens non-lazily. + */ + if (area->flags & VM_GLOBAL_NONSENSITIVE) + asi_unmap(ASI_GLOBAL_NONSENSITIVE, area->addr, + get_vm_area_size(area), true); +} + static inline void setup_vmalloc_vm_locked(struct vm_struct *vm, struct vmap_area *va, unsigned long flags, const void *caller) { @@ -2570,6 +2597,7 @@ static void vm_remove_mappings(struct vm_struct *area, int deallocate_pages) int flush_dmap = 0; int i; + asi_unmap_vm_area(area); remove_vm_area(area->addr); /* If this is not VM_FLUSH_RESET_PERMS memory, no need for the below. */ @@ -2787,16 +2815,20 @@ void *vmap(struct page **pages, unsigned int count, addr = (unsigned long)area->addr; if (vmap_pages_range(addr, addr + size, pgprot_nx(prot), - pages, PAGE_SHIFT) < 0) { - vunmap(area->addr); - return NULL; - } + pages, PAGE_SHIFT) < 0) + goto err; + + if (asi_map_vm_area(area)) + goto err; if (flags & VM_MAP_PUT_PAGES) { area->pages = pages; area->nr_pages = count; } return area->addr; +err: + vunmap(area->addr); + return NULL; } EXPORT_SYMBOL(vmap); @@ -2991,6 +3023,9 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, goto fail; } + if (asi_map_vm_area(area)) + goto fail; + return area->addr; fail: @@ -3038,6 +3073,9 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, if (WARN_ON_ONCE(!size)) return NULL; + if (static_asi_enabled() && (vm_flags & VM_GLOBAL_NONSENSITIVE)) + gfp_mask |= __GFP_ZERO; + if ((size >> PAGE_SHIFT) > totalram_pages()) { warn_alloc(gfp_mask, NULL, "vmalloc error: size %lu, exceeds total pages", @@ -3127,8 +3165,13 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask, int node, const void *caller) { + ulong vm_flags = 0; + + if (static_asi_enabled() && (gfp_mask & __GFP_GLOBAL_NONSENSITIVE)) + vm_flags |= VM_GLOBAL_NONSENSITIVE; + return __vmalloc_node_range(size, align, VMALLOC_START, VMALLOC_END, - gfp_mask, PAGE_KERNEL, 0, node, caller); + gfp_mask, PAGE_KERNEL, vm_flags, node, caller); } /* * This is only for performance analysis of vmalloc and stress purpose. From patchwork Wed Feb 23 05:21:48 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756417 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id F0D71C433F5 for ; Wed, 23 Feb 2022 05:24:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235332AbiBWFZL (ORCPT ); Wed, 23 Feb 2022 00:25:11 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56886 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238249AbiBWFY7 (ORCPT ); Wed, 23 Feb 2022 00:24:59 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9D86D6CA42 for ; Tue, 22 Feb 2022 21:24:13 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id b12-20020a056902030c00b0061d720e274aso26586376ybs.20 for ; Tue, 22 Feb 2022 21:24:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=h48HgrV9Ms87ueeFjUPVjN/Yw+KwlPL1CcnJb5APGwE=; b=Hmt0w/kiWsp/YP+equQ01AdLjZMh/o/2Z8oEGQ6Lq4Nt157PlHz1XpRl+A7GPQXL2i FRSX/mXn548mQM/TC1p/MGu32EFPYWhSLwU4OefqIbwBe+iA80Pue4vU6e/ZB7nsR4Y4 523kigSioD12IuJ5T0licYK9j98yC1QXVRYs3anUzXJCRNXAH9qqii0hyNma152NX9Ps ggEjc8Pz3upnBint0xIR6P2QSeIhMzI8K84X5fX/AWfAvKn+5L2W9liNKxXaSWKaCFma +45lbJ2fC4qur3JJ5wHMSfnoIjceTnzqkA7atH2Cu3t0CWoRI/zCfcvLHfSjM3PIpuRT HILQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=h48HgrV9Ms87ueeFjUPVjN/Yw+KwlPL1CcnJb5APGwE=; b=0MVL3NRRie9rA60wOkE3otvnTZtauFn6dysMHOEC5w32LY0cFj8nCGQ7k35Y+Z2ON+ 91/eGS6KNjDrA+oa5550b4gyTiO6pAHscgSx+MYfRWQMVVQfACjQRc/JOTBViUsJ6tKF Gd2Z7NsO9chz+IoBomYSIF/az/rv6wPT2ItSIccMXfeWwsrMyMnGA6aCUF7W2Xcueas4 mF+A+zYWWC2Bw3RBCgdqM2UX0eMhVTbzw9SdrJDuBSBCVVrYBpO3f5lJhbuy35cluSIf cWisW7ZFrF93nGRKgYlmB62W/gNR4v1/28GVteOpVTdzGmn5KCR9VL2yqmTmV++wjeL5 hMXA== X-Gm-Message-State: AOAM5331qLrCv6WZZelktrB2UNOYyskaFVqS0TT3XIgxYzRZ+vqq3ZGi vqXtE+GRoGDm2remWvqURcwv5CSEDQpW X-Google-Smtp-Source: ABdhPJyfC5FSpMaAtl/McL2SxzvvGsqtTFjrUAQSrC01vyuUjm1vV+WNWVvhncOg+7K7oU2DAPMbItihfugH X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a81:1c47:0:b0:2d7:5822:1739 with SMTP id c68-20020a811c47000000b002d758221739mr11411035ywc.502.1645593852744; Tue, 22 Feb 2022 21:24:12 -0800 (PST) Date: Tue, 22 Feb 2022 21:21:48 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-13-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 12/47] mm: asi: Support for global non-sensitive slab caches From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org A new flag SLAB_GLOBAL_NONSENSITIVE is added, which would designate all objects within that slab cache to be globally non-sensitive. Another flag SLAB_NONSENSITIVE is also added, which is currently just an alias for SLAB_GLOBAL_NONSENSITIVE, but will eventually be used to designate slab caches which can allocate either global or local non-sensitive objects. In addition, new kmalloc caches have been added that can be used to allocate non-sensitive objects. Signed-off-by: Junaid Shahid --- include/linux/slab.h | 32 +++++++++++++++---- mm/slab.c | 5 +++ mm/slab.h | 14 ++++++++- mm/slab_common.c | 73 +++++++++++++++++++++++++++++++++----------- security/Kconfig | 2 +- 5 files changed, 101 insertions(+), 25 deletions(-) diff --git a/include/linux/slab.h b/include/linux/slab.h index 181045148b06..7b8a3853d827 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -120,6 +120,12 @@ /* Slab deactivation flag */ #define SLAB_DEACTIVATED ((slab_flags_t __force)0x10000000U) +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +#define SLAB_GLOBAL_NONSENSITIVE ((slab_flags_t __force)0x20000000U) +#else +#define SLAB_GLOBAL_NONSENSITIVE 0 +#endif + /* * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests. * @@ -329,6 +335,11 @@ enum kmalloc_cache_type { extern struct kmem_cache * kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1]; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +extern struct kmem_cache * +nonsensitive_kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1]; +#endif + /* * Define gfp bits that should not be set for KMALLOC_NORMAL. */ @@ -361,6 +372,17 @@ static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags) return KMALLOC_CGROUP; } +static __always_inline struct kmem_cache *get_kmalloc_cache(gfp_t flags, + uint index) +{ +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + + if (static_asi_enabled() && (flags & __GFP_GLOBAL_NONSENSITIVE)) + return nonsensitive_kmalloc_caches[kmalloc_type(flags)][index]; +#endif + return kmalloc_caches[kmalloc_type(flags)][index]; +} + /* * Figure out which kmalloc slab an allocation of a certain size * belongs to. @@ -587,9 +609,8 @@ static __always_inline __alloc_size(1) void *kmalloc(size_t size, gfp_t flags) if (!index) return ZERO_SIZE_PTR; - return kmem_cache_alloc_trace( - kmalloc_caches[kmalloc_type(flags)][index], - flags, size); + return kmem_cache_alloc_trace(get_kmalloc_cache(flags, index), + flags, size); #endif } return __kmalloc(size, flags); @@ -605,9 +626,8 @@ static __always_inline __alloc_size(1) void *kmalloc_node(size_t size, gfp_t fla if (!i) return ZERO_SIZE_PTR; - return kmem_cache_alloc_node_trace( - kmalloc_caches[kmalloc_type(flags)][i], - flags, node, size); + return kmem_cache_alloc_node_trace(get_kmalloc_cache(flags, i), + flags, node, size); } #endif return __kmalloc_node(size, flags, node); diff --git a/mm/slab.c b/mm/slab.c index ca4822f6b2b6..5a928d95d67b 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -1956,6 +1956,9 @@ int __kmem_cache_create(struct kmem_cache *cachep, slab_flags_t flags) size = ALIGN(size, REDZONE_ALIGN); } + if (!static_asi_enabled()) + flags &= ~SLAB_NONSENSITIVE; + /* 3) caller mandated alignment */ if (ralign < cachep->align) { ralign = cachep->align; @@ -2058,6 +2061,8 @@ int __kmem_cache_create(struct kmem_cache *cachep, slab_flags_t flags) cachep->allocflags |= GFP_DMA32; if (flags & SLAB_RECLAIM_ACCOUNT) cachep->allocflags |= __GFP_RECLAIMABLE; + if (flags & SLAB_GLOBAL_NONSENSITIVE) + cachep->allocflags |= __GFP_GLOBAL_NONSENSITIVE; cachep->size = size; cachep->reciprocal_buffer_size = reciprocal_value(size); diff --git a/mm/slab.h b/mm/slab.h index 56ad7eea3ddf..f190f4fc0286 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -77,6 +77,10 @@ extern struct kmem_cache *kmem_cache; /* A table of kmalloc cache names and sizes */ extern const struct kmalloc_info_struct { const char *name[NR_KMALLOC_TYPES]; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + const char *nonsensitive_name[NR_KMALLOC_TYPES]; +#endif + slab_flags_t flags[NR_KMALLOC_TYPES]; unsigned int size; } kmalloc_info[]; @@ -124,11 +128,14 @@ static inline slab_flags_t kmem_cache_flags(unsigned int object_size, } #endif +/* This will also include SLAB_LOCAL_NONSENSITIVE in a later patch. */ +#define SLAB_NONSENSITIVE SLAB_GLOBAL_NONSENSITIVE /* Legal flag mask for kmem_cache_create(), for various configurations */ #define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | \ SLAB_CACHE_DMA32 | SLAB_PANIC | \ - SLAB_TYPESAFE_BY_RCU | SLAB_DEBUG_OBJECTS ) + SLAB_TYPESAFE_BY_RCU | SLAB_DEBUG_OBJECTS | \ + SLAB_NONSENSITIVE) #if defined(CONFIG_DEBUG_SLAB) #define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER) @@ -491,6 +498,11 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, might_alloc(flags); + if (static_asi_enabled()) { + VM_BUG_ON(!(s->flags & SLAB_GLOBAL_NONSENSITIVE) && + (flags & __GFP_GLOBAL_NONSENSITIVE)); + } + if (should_failslab(s, flags)) return NULL; diff --git a/mm/slab_common.c b/mm/slab_common.c index e5d080a93009..72dee2494bf8 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -50,7 +50,7 @@ static DECLARE_WORK(slab_caches_to_rcu_destroy_work, SLAB_FAILSLAB | kasan_never_merge()) #define SLAB_MERGE_SAME (SLAB_RECLAIM_ACCOUNT | SLAB_CACHE_DMA | \ - SLAB_CACHE_DMA32 | SLAB_ACCOUNT) + SLAB_CACHE_DMA32 | SLAB_ACCOUNT | SLAB_NONSENSITIVE) /* * Merge control. If this is set then no merging of slab caches will occur. @@ -681,6 +681,15 @@ kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1] __ro_after_init = { /* initialization for https://bugs.llvm.org/show_bug.cgi?id=42570 */ }; EXPORT_SYMBOL(kmalloc_caches); +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + +struct kmem_cache * +nonsensitive_kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1] __ro_after_init = +{ /* initialization for https://bugs.llvm.org/show_bug.cgi?id=42570 */ }; +EXPORT_SYMBOL(nonsensitive_kmalloc_caches); + +#endif + /* * Conversion table for small slabs sizes / 8 to the index in the * kmalloc array. This is necessary for slabs < 192 since we have non power @@ -738,25 +747,34 @@ struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags) index = fls(size - 1); } - return kmalloc_caches[kmalloc_type(flags)][index]; + return get_kmalloc_cache(flags, index); } +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +#define __KMALLOC_NAME(type, base_name, sz) \ + .name[type] = base_name "-" #sz, \ + .nonsensitive_name[type] = "ns-" base_name "-" #sz, +#else +#define __KMALLOC_NAME(type, base_name, sz) \ + .name[type] = base_name "-" #sz, +#endif + #ifdef CONFIG_ZONE_DMA -#define KMALLOC_DMA_NAME(sz) .name[KMALLOC_DMA] = "dma-kmalloc-" #sz, +#define KMALLOC_DMA_NAME(sz) __KMALLOC_NAME(KMALLOC_DMA, "dma-kmalloc", sz) #else #define KMALLOC_DMA_NAME(sz) #endif #ifdef CONFIG_MEMCG_KMEM -#define KMALLOC_CGROUP_NAME(sz) .name[KMALLOC_CGROUP] = "kmalloc-cg-" #sz, +#define KMALLOC_CGROUP_NAME(sz) __KMALLOC_NAME(KMALLOC_CGROUP, "kmalloc-cg", sz) #else #define KMALLOC_CGROUP_NAME(sz) #endif #define INIT_KMALLOC_INFO(__size, __short_size) \ { \ - .name[KMALLOC_NORMAL] = "kmalloc-" #__short_size, \ - .name[KMALLOC_RECLAIM] = "kmalloc-rcl-" #__short_size, \ + __KMALLOC_NAME(KMALLOC_NORMAL, "kmalloc", __short_size) \ + __KMALLOC_NAME(KMALLOC_RECLAIM, "kmalloc-rcl", __short_size) \ KMALLOC_CGROUP_NAME(__short_size) \ KMALLOC_DMA_NAME(__short_size) \ .size = __size, \ @@ -846,18 +864,30 @@ void __init setup_kmalloc_cache_index_table(void) static void __init new_kmalloc_cache(int idx, enum kmalloc_cache_type type, slab_flags_t flags) { + struct kmem_cache *(*caches)[KMALLOC_SHIFT_HIGH + 1] = kmalloc_caches; + const char *name = kmalloc_info[idx].name[type]; + +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + + if (flags & SLAB_NONSENSITIVE) { + caches = nonsensitive_kmalloc_caches; + name = kmalloc_info[idx].nonsensitive_name[type]; + } +#endif + if (type == KMALLOC_RECLAIM) { flags |= SLAB_RECLAIM_ACCOUNT; } else if (IS_ENABLED(CONFIG_MEMCG_KMEM) && (type == KMALLOC_CGROUP)) { if (cgroup_memory_nokmem) { - kmalloc_caches[type][idx] = kmalloc_caches[KMALLOC_NORMAL][idx]; + caches[type][idx] = caches[KMALLOC_NORMAL][idx]; return; } flags |= SLAB_ACCOUNT; + } else if (IS_ENABLED(CONFIG_ZONE_DMA) && (type == KMALLOC_DMA)) { + flags |= SLAB_CACHE_DMA; } - kmalloc_caches[type][idx] = create_kmalloc_cache( - kmalloc_info[idx].name[type], + caches[type][idx] = create_kmalloc_cache(name, kmalloc_info[idx].size, flags, 0, kmalloc_info[idx].size); @@ -866,7 +896,7 @@ new_kmalloc_cache(int idx, enum kmalloc_cache_type type, slab_flags_t flags) * KMALLOC_NORMAL caches. */ if (IS_ENABLED(CONFIG_MEMCG_KMEM) && (type == KMALLOC_NORMAL)) - kmalloc_caches[type][idx]->refcount = -1; + caches[type][idx]->refcount = -1; } /* @@ -908,15 +938,24 @@ void __init create_kmalloc_caches(slab_flags_t flags) for (i = 0; i <= KMALLOC_SHIFT_HIGH; i++) { struct kmem_cache *s = kmalloc_caches[KMALLOC_NORMAL][i]; - if (s) { - kmalloc_caches[KMALLOC_DMA][i] = create_kmalloc_cache( - kmalloc_info[i].name[KMALLOC_DMA], - kmalloc_info[i].size, - SLAB_CACHE_DMA | flags, 0, - kmalloc_info[i].size); - } + if (s) + new_kmalloc_cache(i, KMALLOC_DMA, flags); } #endif + /* + * TODO: We may want to make slab allocations without exiting ASI. + * In that case, the cache metadata itself would need to be + * treated as non-sensitive and mapped as such, and we would need to + * do the bootstrap much more carefully. We can do that if we find + * that slab allocations while inside a restricted address space are + * frequent enough to warrant the additional complexity. + */ + if (static_asi_enabled()) + for (type = KMALLOC_NORMAL; type < NR_KMALLOC_TYPES; type++) + for (i = 0; i <= KMALLOC_SHIFT_HIGH; i++) + if (kmalloc_caches[type][i]) + new_kmalloc_cache(i, type, + flags | SLAB_NONSENSITIVE); } #endif /* !CONFIG_SLOB */ diff --git a/security/Kconfig b/security/Kconfig index 21b15ecaf2c1..0a3e49d6a331 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -68,7 +68,7 @@ config PAGE_TABLE_ISOLATION config ADDRESS_SPACE_ISOLATION bool "Allow code to run with a reduced kernel address space" default n - depends on X86_64 && !UML + depends on X86_64 && !UML && SLAB depends on !PARAVIRT help This feature provides the ability to run some kernel code From patchwork Wed Feb 23 05:21:49 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756419 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 18F4FC433F5 for ; Wed, 23 Feb 2022 05:25:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238196AbiBWFZm (ORCPT ); Wed, 23 Feb 2022 00:25:42 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56740 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238213AbiBWFZC (ORCPT ); Wed, 23 Feb 2022 00:25:02 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AC3A16D3BC for ; Tue, 22 Feb 2022 21:24:22 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id o5-20020a25d705000000b0062499d760easo8076797ybg.7 for ; Tue, 22 Feb 2022 21:24:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=J1PJHeAg+5GsFDWdvzoAJvbqIGqngaUErLJYlufPa+k=; b=E8ylcvEY7HObuUi7YReOD7Xi+5ii//8STuZIJO/P8zS7AqqFvr59ONAjhLdHaJFCVd ANMtkHwUkTCW3M7pycyZwbgMNj908KdgtDwrheZQ6geVat7tNVM68QbPQQZm1CRl1l6l Y4ISDUE0Oi8e8zBW/PcuKqI33TayofpiZRwi9zPR+wcWJ0CNMO0p+82XCjaXWKVoLCcG P0m9VyO7WwmFPDExN5wo9Zkywpckb/NizMoYl/Y1S6i3OD/kq/dMFyBf0yRW04D/GG97 5W9at4d8tgmUGtV7LAQny8Hl7A704zcz/QnIbquS3/UrNABcFf6svGwZpVuu40Fb0cN9 xygg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=J1PJHeAg+5GsFDWdvzoAJvbqIGqngaUErLJYlufPa+k=; b=XbTLVMy4XOrIOSBHqDwGAN+qBRaXFDBz/t7OdeqVsA7MoKIpZuiENwdph6pol9VrQZ IvV691MwZegj3BqPSmMHIwpPT1FRA2MUI+HMGh4ZvsJ1kdc2ASdrvEsf3sXFYCI4RHOd +yM4Xg6nX8XOW7kYCCSIeFAQVU4l+W8s8EFEZGAhT1esPw34n77+NIsJcWfBgAGYzMew KoJDHu/ce/JMvzys/M5DPLimadnUs3qZP7NvXIT57HKHa8JgWX1Qbko6wKed5K2zipCo DgLLDoH4R2y/8KQsSTmsdu7mjrhOS7iXtMZsql6WMv0di69TW8Mzp+/Tgsu6xTSZeUHb P5sw== X-Gm-Message-State: AOAM531K/zxcZLrgOx2If6O/N7WdPJ83E2OPRTjB1X8IkwR+dAO2oS8y FL726xW38dvaLfUBWlS76j0n00ZfQK9F X-Google-Smtp-Source: ABdhPJwO2zSkIt0Rgd4eSBL4j29VK3OAOmCOM+xoXLfQvXzLhs6JlCYQqS7jIhDe/nPgasCVydPPru7Y6T7c X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a81:106:0:b0:2d0:e682:8a7a with SMTP id 6-20020a810106000000b002d0e6828a7amr27939534ywb.257.1645593854910; Tue, 22 Feb 2022 21:24:14 -0800 (PST) Date: Tue, 22 Feb 2022 21:21:49 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-14-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 13/47] asi: Added ASI memory cgroup flag From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: Ofir Weisse , kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Ofir Weisse Adds a cgroup flag to control if ASI is enabled for processes in that cgroup. Can be set or cleared by writing to the memory.use_asi file in the memory cgroup. The flag only affects new processes created after the flag was set. In addition to the cgroup flag, we may also want to add a per-process flag, though it will have to be something that can be set at process creation time. Signed-off-by: Ofir Weisse Co-developed-by: Junaid Shahid Signed-off-by: Junaid Shahid --- arch/x86/mm/asi.c | 14 ++++++++++++++ include/linux/memcontrol.h | 3 +++ include/linux/mm_types.h | 17 +++++++++++++++++ mm/memcontrol.c | 30 ++++++++++++++++++++++++++++++ 4 files changed, 64 insertions(+) diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c index 71348399baf1..ca50a32ecd7e 100644 --- a/arch/x86/mm/asi.c +++ b/arch/x86/mm/asi.c @@ -2,6 +2,7 @@ #include #include +#include #include #include @@ -322,7 +323,20 @@ EXPORT_SYMBOL_GPL(asi_exit); void asi_init_mm_state(struct mm_struct *mm) { + struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm); + memset(mm->asi, 0, sizeof(mm->asi)); + mm->asi_enabled = false; + + /* + * TODO: In addition to a cgroup flag, we may also want a per-process + * flag. + */ + if (memcg) { + mm->asi_enabled = boot_cpu_has(X86_FEATURE_ASI) && + memcg->use_asi; + css_put(&memcg->css); + } } static bool is_page_within_range(size_t addr, size_t page_size, diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0c5c403f4be6..a883cb458b06 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -259,6 +259,9 @@ struct mem_cgroup { */ bool oom_group; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + bool use_asi; +#endif /* protected by memcg_oom_lock */ bool oom_lock; int under_oom; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 5b8028fcfe67..8624d2783661 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -607,6 +607,14 @@ struct mm_struct { * new_owner->alloc_lock is held */ struct task_struct __rcu *owner; + +#endif +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + /* Is ASI enabled for this mm? ASI requires allocating extra + * resources, such as ASI page tables. To prevent allocationg + * these resources for every mm in the system, we expect that + * only VM mm's will have this flag set. */ + bool asi_enabled; #endif struct user_namespace *user_ns; @@ -665,6 +673,15 @@ struct mm_struct { extern struct mm_struct init_mm; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +static inline bool mm_asi_enabled(struct mm_struct *mm) +{ + return mm->asi_enabled; +} +#else +static inline bool mm_asi_enabled(struct mm_struct *mm) { return false; } +#endif + /* Pointer magic because the dynamic array size confuses some compilers. */ static inline void mm_init_cpumask(struct mm_struct *mm) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2ed5f2a0879d..a66d6b222ecf 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3539,6 +3539,29 @@ static int mem_cgroup_hierarchy_write(struct cgroup_subsys_state *css, return -EINVAL; } +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + +static u64 mem_cgroup_asi_read(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + return mem_cgroup_from_css(css)->use_asi; +} + +static int mem_cgroup_asi_write(struct cgroup_subsys_state *css, + struct cftype *cft, u64 val) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + + if (val == 1 || val == 0) + memcg->use_asi = val; + else + return -EINVAL; + + return 0; +} + +#endif + static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) { unsigned long val; @@ -4888,6 +4911,13 @@ static struct cftype mem_cgroup_legacy_files[] = { .write_u64 = mem_cgroup_hierarchy_write, .read_u64 = mem_cgroup_hierarchy_read, }, +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + { + .name = "use_asi", + .write_u64 = mem_cgroup_asi_write, + .read_u64 = mem_cgroup_asi_read, + }, +#endif { .name = "cgroup.event_control", /* XXX: for compat */ .write = memcg_write_event_control, From patchwork Wed Feb 23 05:21:50 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756423 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 80C94C433EF for ; Wed, 23 Feb 2022 05:25:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238190AbiBWFZy (ORCPT ); Wed, 23 Feb 2022 00:25:54 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57788 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238198AbiBWFZe (ORCPT ); Wed, 23 Feb 2022 00:25:34 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id ABAA769CC1 for ; Tue, 22 Feb 2022 21:24:30 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id r14-20020a5b018e000000b00624f6f97bf4so306800ybl.12 for ; Tue, 22 Feb 2022 21:24:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=ObwiCZnkzVE8MG/ZKfaaMbJx1EDS+iQex1h+i9r1lWc=; b=gZ9xDnCTT9J2il7FQZjHDQXU7P24qq79E3XEa26PGLeqDS7nHRklN55Z0YrbQllfvk a54Tnc7QS05X1TZviBvKI6s2J4e0JaxU2Mc4n2JddMKt6uhs1bA5ThP/H5jQeUIa+Q5L YzXtSBRLRL23nYFayWpilSuGrgyaEtzeVJtrSMs+fJOYlGhJTGeNFUiGfho3OKAJCQvL xqhDtH8EyJDw3FY1cTPmzwnJaSBjdeIgm+usm4Xbnvmu4zwwAyO/ya/YpXR5JHhjT9H0 Wpbgj5s5UiyqBd60tJZ8fxwBXyX5DCw1p9aGG11pd0ckG2SC7DJ9q4R1uxUVnHb2V4YR +hNg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=ObwiCZnkzVE8MG/ZKfaaMbJx1EDS+iQex1h+i9r1lWc=; b=AWsNfPWsMKftFcIBOdPA0BMClcVDH9yRvXHNNuWjBwjytdeFD6BJoiTXUiyvnnVYOz jTCPBEREyV6RL/s03xvAX4pvJPGHyT0Zm21F1S83kZgmfEB9ON+tnIQ1V1euY3dTgMqF LRq7SRJ9WN/zkasPZlaOiGYj9/2+H6yxL5aTUrMjs+ggVM1T+4EyXUpxi5SX5BYClzQZ Cjqxqf7EeijOEQMal0ZZP5Kd2BtqfFv//4zUm81uODX51vXcHkFoX53FtONghmXGh3vj uIgYm18HRqJloXXPui3PZZDqKJhmSNOCgkh3iJV6Et5glfz1CcC5bJejpn/wft7Nn8+y 2tvQ== X-Gm-Message-State: AOAM530blU1/rHLvAHU1g5/pjhn3lA305S4Rxl+Wf+I6OOs/gBjmOpvs rIuUQYIP7gab0m0WZPSo98K76Sj4j2Xk X-Google-Smtp-Source: ABdhPJwnMSRMCv1r9WCSdtorggC4aqKc4d1lpyRHc4R/HT3Ens8gLxTDNQT3s1kJiX6qfOat6YB18uabGISu X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a0d:db0d:0:b0:2d0:e912:3e47 with SMTP id d13-20020a0ddb0d000000b002d0e9123e47mr27008531ywe.23.1645593857064; Tue, 22 Feb 2022 21:24:17 -0800 (PST) Date: Tue, 22 Feb 2022 21:21:50 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-15-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 14/47] mm: asi: Disable ASI API when ASI is not enabled for a process From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org If ASI is not enabled for a process, then asi_init() will return a NULL ASI pointer as output, though it will return a 0 error code. All other ASI API functions will return without an error when they get a NULL ASI pointer. Signed-off-by: Junaid Shahid --- arch/x86/include/asm/asi.h | 2 +- arch/x86/mm/asi.c | 18 ++++++++++-------- include/asm-generic/asi.h | 7 ++++++- 3 files changed, 17 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h index 64c2b4d1dba2..f69e1f2f09a4 100644 --- a/arch/x86/include/asm/asi.h +++ b/arch/x86/include/asm/asi.h @@ -51,7 +51,7 @@ int asi_register_class(const char *name, uint flags, const struct asi_hooks *ops); void asi_unregister_class(int index); -int asi_init(struct mm_struct *mm, int asi_index); +int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi); void asi_destroy(struct asi *asi); void asi_enter(struct asi *asi); diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c index ca50a32ecd7e..58d1c532274a 100644 --- a/arch/x86/mm/asi.c +++ b/arch/x86/mm/asi.c @@ -207,11 +207,13 @@ static int __init asi_global_init(void) } subsys_initcall(asi_global_init) -int asi_init(struct mm_struct *mm, int asi_index) +int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi) { struct asi *asi = &mm->asi[asi_index]; - if (!boot_cpu_has(X86_FEATURE_ASI)) + *out_asi = NULL; + + if (!boot_cpu_has(X86_FEATURE_ASI) || !mm->asi_enabled) return 0; /* Index 0 is reserved for special purposes. */ @@ -238,13 +240,15 @@ int asi_init(struct mm_struct *mm, int asi_index) set_pgd(asi->pgd + i, asi_global_nonsensitive_pgd[i]); } + *out_asi = asi; + return 0; } EXPORT_SYMBOL_GPL(asi_init); void asi_destroy(struct asi *asi) { - if (!boot_cpu_has(X86_FEATURE_ASI)) + if (!boot_cpu_has(X86_FEATURE_ASI) || !asi) return; asi_free_pgd(asi); @@ -278,11 +282,9 @@ void __asi_enter(void) void asi_enter(struct asi *asi) { - if (!static_cpu_has(X86_FEATURE_ASI)) + if (!static_cpu_has(X86_FEATURE_ASI) || !asi) return; - VM_WARN_ON_ONCE(!asi); - this_cpu_write(asi_cpu_state.target_asi, asi); barrier(); @@ -423,7 +425,7 @@ int asi_map_gfp(struct asi *asi, void *addr, size_t len, gfp_t gfp_flags) size_t end = start + len; size_t page_size; - if (!static_cpu_has(X86_FEATURE_ASI)) + if (!static_cpu_has(X86_FEATURE_ASI) || !asi) return 0; VM_BUG_ON(start & ~PAGE_MASK); @@ -514,7 +516,7 @@ void asi_unmap(struct asi *asi, void *addr, size_t len, bool flush_tlb) size_t end = start + len; pgtbl_mod_mask mask = 0; - if (!static_cpu_has(X86_FEATURE_ASI) || !len) + if (!static_cpu_has(X86_FEATURE_ASI) || !asi || !len) return; VM_BUG_ON(start & ~PAGE_MASK); diff --git a/include/asm-generic/asi.h b/include/asm-generic/asi.h index f918cd052722..51c9c4a488e8 100644 --- a/include/asm-generic/asi.h +++ b/include/asm-generic/asi.h @@ -33,7 +33,12 @@ static inline void asi_unregister_class(int asi_index) { } static inline void asi_init_mm_state(struct mm_struct *mm) { } -static inline int asi_init(struct mm_struct *mm, int asi_index) { return 0; } +static inline +int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi) +{ + *out_asi = NULL; + return 0; +} static inline void asi_destroy(struct asi *asi) { } From patchwork Wed Feb 23 05:21:51 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756422 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8886DC433EF for ; Wed, 23 Feb 2022 05:25:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238229AbiBWFZv (ORCPT ); Wed, 23 Feb 2022 00:25:51 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57808 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238249AbiBWFZg (ORCPT ); Wed, 23 Feb 2022 00:25:36 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CE7516B096 for ; Tue, 22 Feb 2022 21:24:32 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id s133-20020a252c8b000000b0062112290d0bso26612868ybs.23 for ; Tue, 22 Feb 2022 21:24:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=CYYrqCq/zRkZaGiiFDeL2WZqXbaiwLweAdvv3KPw30Q=; b=gttTspcTbeOxem9vIHTEp8HxBKLNL6kOKKunFMomQY7gQjAmmA7zLLKeW3pD5BHCG4 n+cavMcNG3z81q2MsF1OUm8LD7b7lNc/CiICe2jeX2JuESrjhWO3zBSCqlZo0v8ri/13 wSpPkKBkIUU4Lon1cqB6YDkf8UKIBOe6n1yOTYHkccD+ZHQbtAOffPifXTCZr0f1Etmj 7oxfXxlKJ56Y/Q570Q0HNRmXemUr5RxMgLdTEadmhWn3pzJBdmkD+yWBtTx1hkL9PZjC s4h9v+aGt/9IsWDNJnCTYuoYPrr3fqgrmkUg8lETXm3ikW63y2HAoRDcoo7ArQV+H7ZF BqDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=CYYrqCq/zRkZaGiiFDeL2WZqXbaiwLweAdvv3KPw30Q=; b=iTqhGO8V+NfdYd64PFstHiSwJWVW84GIIjMRnbH3xSurvg4NN1Hr2CzSQ3tlrwAo4c MgGTy0pfDmprVrMmJuehENKh0S0TFjJ1YtH4agpN21z2lT5uYx8IElOKxWpgZpVd9i+E AKuSySE3TozrfgVU70TXb3tBCYZVM0NM5/vm7CAidNPEj35bCwFD7xR49OsAyWKtGz91 EKBORCHVf0SSC4s2pcYVbIirAuiYZqznHwX7BkbQbBLs9ulu8Y2aV3ydktPm2EBpOEOw WnlqtGeOe24/YPeq91zgznZBDfU0K63zk1ur3oBTk4yS8rnAapBYM0jxAwANlSWxHh48 f2Tg== X-Gm-Message-State: AOAM532ejSGrBccG5K0OjC8LecvWhrX/YUoo1qqZeB1oL5KBq2ZFnXQ5 d5R91GqovP/f1pRXMN6F5jPn3EnFUE4K X-Google-Smtp-Source: ABdhPJyHSwIU1ypDJg3UAIk4OCagkwsYrp4hfbAB5svLD0KoToAB7kgIvdsaPCw4R5P2Bj++u4Uv/BXKWrDE X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a81:9842:0:b0:2cb:86f2:560d with SMTP id p63-20020a819842000000b002cb86f2560dmr27884434ywg.375.1645593859219; Tue, 22 Feb 2022 21:24:19 -0800 (PST) Date: Tue, 22 Feb 2022 21:21:51 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-16-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 15/47] kvm: asi: Restricted address space for VM execution From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org An ASI restricted address space is added for KVM. It is currently only enabled for Intel CPUs. The ASI hooks have been setup to do an L1D cache flush and MDS clear when entering the restricted address space. The hooks are also meant to stun and unstun the sibling hyperthread when exiting and entering the restricted address space. Internally, we do have a full stunning implementation available, but it hasn't yet been determined whether it is fully compatible with the upstream core scheduling implementation, so it is not included in this patch series and instead this patch just includes corresponding stub functions to demonstrate where the stun/unstun would happen. Signed-off-by: Junaid Shahid --- arch/x86/include/asm/kvm_host.h | 2 + arch/x86/kvm/vmx/vmx.c | 41 ++++++++++++----- arch/x86/kvm/x86.c | 81 ++++++++++++++++++++++++++++++++- include/linux/kvm_host.h | 2 + 4 files changed, 113 insertions(+), 13 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 555f4de47ef2..98cbd6447e3e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1494,6 +1494,8 @@ struct kvm_x86_ops { int (*complete_emulated_msr)(struct kvm_vcpu *vcpu, int err); void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector); + + void (*flush_sensitive_cpu_state)(struct kvm_vcpu *vcpu); }; struct kvm_x86_nested_ops { diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 0dbf94eb954f..e0178b57be75 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -47,6 +47,7 @@ #include #include #include +#include #include "capabilities.h" #include "cpuid.h" @@ -300,7 +301,7 @@ static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf) else static_branch_disable(&vmx_l1d_should_flush); - if (l1tf == VMENTER_L1D_FLUSH_COND) + if (l1tf == VMENTER_L1D_FLUSH_COND && !boot_cpu_has(X86_FEATURE_ASI)) static_branch_enable(&vmx_l1d_flush_cond); else static_branch_disable(&vmx_l1d_flush_cond); @@ -6079,6 +6080,8 @@ static noinstr void vmx_l1d_flush(struct kvm_vcpu *vcpu) if (static_branch_likely(&vmx_l1d_flush_cond)) { bool flush_l1d; + VM_BUG_ON(vcpu->kvm->asi); + /* * Clear the per-vcpu flush bit, it gets set again * either from vcpu_run() or from one of the unsafe @@ -6590,16 +6593,31 @@ static fastpath_t vmx_exit_handlers_fastpath(struct kvm_vcpu *vcpu) } } -static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu, - struct vcpu_vmx *vmx) +static void vmx_flush_sensitive_cpu_state(struct kvm_vcpu *vcpu) { - kvm_guest_enter_irqoff(); - /* L1D Flush includes CPU buffer clear to mitigate MDS */ if (static_branch_unlikely(&vmx_l1d_should_flush)) vmx_l1d_flush(vcpu); else if (static_branch_unlikely(&mds_user_clear)) mds_clear_cpu_buffers(); +} + +static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu, + struct vcpu_vmx *vmx) +{ + unsigned long cr3; + + kvm_guest_enter_irqoff(); + + vmx_flush_sensitive_cpu_state(vcpu); + + asi_enter(vcpu->kvm->asi); + + cr3 = __get_current_cr3_fast(); + if (unlikely(cr3 != vmx->loaded_vmcs->host_state.cr3)) { + vmcs_writel(HOST_CR3, cr3); + vmx->loaded_vmcs->host_state.cr3 = cr3; + } if (vcpu->arch.cr2 != native_read_cr2()) native_write_cr2(vcpu->arch.cr2); @@ -6609,13 +6627,16 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu, vcpu->arch.cr2 = native_read_cr2(); + VM_WARN_ON_ONCE(vcpu->kvm->asi && !is_asi_active()); + asi_set_target_unrestricted(); + kvm_guest_exit_irqoff(); } static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); - unsigned long cr3, cr4; + unsigned long cr4; /* Record the guest's net vcpu time for enforced NMI injections. */ if (unlikely(!enable_vnmi && @@ -6657,12 +6678,6 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu) if (kvm_register_is_dirty(vcpu, VCPU_REGS_RIP)) vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]); - cr3 = __get_current_cr3_fast(); - if (unlikely(cr3 != vmx->loaded_vmcs->host_state.cr3)) { - vmcs_writel(HOST_CR3, cr3); - vmx->loaded_vmcs->host_state.cr3 = cr3; - } - cr4 = cr4_read_shadow(); if (unlikely(cr4 != vmx->loaded_vmcs->host_state.cr4)) { vmcs_writel(HOST_CR4, cr4); @@ -7691,6 +7706,8 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = { .complete_emulated_msr = kvm_complete_insn_gp, .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector, + + .flush_sensitive_cpu_state = vmx_flush_sensitive_cpu_state, }; static __init void vmx_setup_user_return_msrs(void) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e50e97ac4408..dd07f677d084 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -81,6 +81,7 @@ #include #include #include +#include #define CREATE_TRACE_POINTS #include "trace.h" @@ -297,6 +298,8 @@ EXPORT_SYMBOL_GPL(supported_xcr0); static struct kmem_cache *x86_emulator_cache; +static int __read_mostly kvm_asi_index; + /* * When called, it means the previous get/set msr reached an invalid msr. * Return true if we want to ignore/silent this failed msr access. @@ -8620,6 +8623,50 @@ static struct notifier_block pvclock_gtod_notifier = { }; #endif +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + +/* + * We have an HT-stunning implementation available internally, + * but it is yet to be determined if it is fully compatible with the + * upstream core scheduling implementation. So leaving it out for now + * and just leaving these stubs here. + */ +static void stun_sibling(void) { } +static void unstun_sibling(void) { } + +/* + * This function must be fully re-entrant and idempotent. + * Though the idempotency requirement could potentially be relaxed for stuff + * like stats where complete accuracy is not needed. + */ +static void kvm_pre_asi_exit(void) +{ + stun_sibling(); +} + +/* + * This function must be fully re-entrant and idempotent. + * Though the idempotency requirement could potentially be relaxed for stuff + * like stats where complete accuracy is not needed. + */ +static void kvm_post_asi_enter(void) +{ + struct kvm_vcpu *vcpu = raw_cpu_read(*kvm_get_running_vcpus()); + + kvm_x86_ops.flush_sensitive_cpu_state(vcpu); + + unstun_sibling(); +} + +#endif + +static const struct asi_hooks kvm_asi_hooks = { +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + .pre_asi_exit = kvm_pre_asi_exit, + .post_asi_enter = kvm_post_asi_enter +#endif +}; + int kvm_arch_init(void *opaque) { struct kvm_x86_init_ops *ops = opaque; @@ -8674,6 +8721,15 @@ int kvm_arch_init(void *opaque) if (r) goto out_free_percpu; + if (ops->runtime_ops->flush_sensitive_cpu_state) { + r = asi_register_class("KVM", ASI_MAP_STANDARD_NONSENSITIVE, + &kvm_asi_hooks); + if (r < 0) + goto out_mmu_exit; + + kvm_asi_index = r; + } + kvm_timer_init(); perf_register_guest_info_callbacks(&kvm_guest_cbs); @@ -8694,6 +8750,8 @@ int kvm_arch_init(void *opaque) return 0; +out_mmu_exit: + kvm_mmu_module_exit(); out_free_percpu: free_percpu(user_return_msrs); out_free_x86_emulator_cache: @@ -8720,6 +8778,11 @@ void kvm_arch_exit(void) irq_work_sync(&pvclock_irq_work); cancel_work_sync(&pvclock_gtod_work); #endif + if (kvm_asi_index > 0) { + asi_unregister_class(kvm_asi_index); + kvm_asi_index = 0; + } + kvm_x86_ops.hardware_enable = NULL; kvm_mmu_module_exit(); free_percpu(user_return_msrs); @@ -11391,11 +11454,26 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn); kvm_apicv_init(kvm); + + if (kvm_asi_index > 0) { + ret = asi_init(kvm->mm, kvm_asi_index, &kvm->asi); + if (ret) + goto error; + } + kvm_hv_init_vm(kvm); kvm_mmu_init_vm(kvm); kvm_xen_init_vm(kvm); - return static_call(kvm_x86_vm_init)(kvm); + ret = static_call(kvm_x86_vm_init)(kvm); + if (ret) + goto error; + + return 0; +error: + kvm_page_track_cleanup(kvm); + asi_destroy(kvm->asi); + return ret; } int kvm_arch_post_init_vm(struct kvm *kvm) @@ -11549,6 +11627,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm) kvm_page_track_cleanup(kvm); kvm_xen_destroy_vm(kvm); kvm_hv_destroy_vm(kvm); + asi_destroy(kvm->asi); } static void memslot_rmap_free(struct kvm_memory_slot *slot) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index c310648cc8f1..9dd63ed21f75 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -38,6 +38,7 @@ #include #include +#include #ifndef KVM_MAX_VCPU_IDS #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS @@ -551,6 +552,7 @@ struct kvm { */ struct mutex slots_arch_lock; struct mm_struct *mm; /* userspace tied to this vm */ + struct asi *asi; struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM]; struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]; From patchwork Wed Feb 23 05:21:52 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756421 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D33E3C433F5 for ; Wed, 23 Feb 2022 05:25:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237620AbiBWFZt (ORCPT ); Wed, 23 Feb 2022 00:25:49 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56888 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238256AbiBWFZg (ORCPT ); Wed, 23 Feb 2022 00:25:36 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 28EF76BDCB for ; Tue, 22 Feb 2022 21:24:34 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id l3-20020a25ad43000000b0062462e2af34so11456972ybe.17 for ; Tue, 22 Feb 2022 21:24:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=aBDkmRlW2wO+VZF+yuNfjgMBKgOQnmUW3eUrwjZrCS8=; b=YOGSR3d5gsNbX6iUzwto0/6/twFuZLdwqQ+OyGBmDQ+Y0Z9tIxowj80ZJNXhvJxX+Y 39mdIpZnjCbsNOAMZ3WcDDj7mi8QZH0c/RsyMzdcAwnc6PykioINuQH19G0EC5hHlDik 23zXjrmfS9wbxjdu3O5ye1/Ud1PUIpPDZK7t9Nl7PN6MOX3LEYFZ3t9yeI4RY0EL+Zno WAZ3yPIe48Tjd2EqQRhcT9O3QomAL8NNi0J1Bk6EaY/8ic2RRlHQl+dYmmLeJ/OqUWOZ mOxvcqRrmNjCUbbMYTjvZgG5UsBId8RgCT1I73bDH8qdngvHAJMKuao3IUJPZHo/KDG+ YImA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=aBDkmRlW2wO+VZF+yuNfjgMBKgOQnmUW3eUrwjZrCS8=; b=k6/WoNDAcqITC5Iw07MQP5mTC5L7mgwV3M7PUxvmtAuHI1V/evZwTn4b5RZRY2nkRA +o2eUQkP44sHmWbeb9qd+B+3AuSUJNLdtm9FZFxe7FPIA+aNjSSsTVaA+MqlMSwvgJlS O0IPybFzioK9BBw/OAx4Mjspdul9twNnvXbEWmqxrjmqpH1pSvAMDL5aKpis5rQvn8Mq b1nVdFRsYbSHM72c4YhHWrW6JrzYWgyOrGRqyvzGHXUs2lFKZOT99vGxQI8CqLucZdDZ h7iplR9sLLv9bjAVRBertgKL9py2e15DmwEtkFVG04D/dtJHfp7hS0z+bdbpnkAwYAJt nQ4g== X-Gm-Message-State: AOAM531BH7ktg2LvbBMVKxM90rf3NmkXuAXE4R9zyiuPHuZJhhfW+pX5 O+e++RBS8xloc/ZtcC8n6sDXnNaJnHWy X-Google-Smtp-Source: ABdhPJxviHHfMBTme5NCgrk3enqRX6Pa+N3RjWDQxUKeUW0rsQhJHGhPnnpDhjrkSitIn82U+v9xIBog7t0o X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a81:7489:0:b0:2d1:518:8c57 with SMTP id p131-20020a817489000000b002d105188c57mr27268814ywc.69.1645593861309; Tue, 22 Feb 2022 21:24:21 -0800 (PST) Date: Tue, 22 Feb 2022 21:21:52 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-17-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 16/47] mm: asi: Support for mapping non-sensitive pcpu chunks From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org This adds support for mapping and unmapping dynamic percpu chunks as globally non-sensitive. A later patch will modify the percpu allocator to use this for dynamically allocating non-sensitive percpu memory. Signed-off-by: Junaid Shahid --- include/linux/vmalloc.h | 4 ++-- mm/percpu-vm.c | 51 +++++++++++++++++++++++++++++++++-------- mm/vmalloc.c | 17 ++++++++++---- security/Kconfig | 2 +- 4 files changed, 58 insertions(+), 16 deletions(-) diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index c7c66decda3e..5f85690f27b6 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -260,14 +260,14 @@ extern __init void vm_area_register_early(struct vm_struct *vm, size_t align); # ifdef CONFIG_MMU struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, const size_t *sizes, int nr_vms, - size_t align); + size_t align, ulong flags); void pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms); # else static inline struct vm_struct ** pcpu_get_vm_areas(const unsigned long *offsets, const size_t *sizes, int nr_vms, - size_t align) + size_t align, ulong flags) { return NULL; } diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c index 2054c9213c43..5579a96ad782 100644 --- a/mm/percpu-vm.c +++ b/mm/percpu-vm.c @@ -153,8 +153,12 @@ static void __pcpu_unmap_pages(unsigned long addr, int nr_pages) static void pcpu_unmap_pages(struct pcpu_chunk *chunk, struct page **pages, int page_start, int page_end) { + struct vm_struct **vms = (struct vm_struct **)chunk->data; unsigned int cpu; int i; + ulong addr, nr_pages; + + nr_pages = page_end - page_start; for_each_possible_cpu(cpu) { for (i = page_start; i < page_end; i++) { @@ -164,8 +168,14 @@ static void pcpu_unmap_pages(struct pcpu_chunk *chunk, WARN_ON(!page); pages[pcpu_page_idx(cpu, i)] = page; } - __pcpu_unmap_pages(pcpu_chunk_addr(chunk, cpu, page_start), - page_end - page_start); + addr = pcpu_chunk_addr(chunk, cpu, page_start); + + /* TODO: We should batch the TLB flushes */ + if (vms[0]->flags & VM_GLOBAL_NONSENSITIVE) + asi_unmap(ASI_GLOBAL_NONSENSITIVE, (void *)addr, + nr_pages * PAGE_SIZE, true); + + __pcpu_unmap_pages(addr, nr_pages); } } @@ -212,18 +222,30 @@ static int __pcpu_map_pages(unsigned long addr, struct page **pages, * reverse lookup (addr -> chunk). */ static int pcpu_map_pages(struct pcpu_chunk *chunk, - struct page **pages, int page_start, int page_end) + struct page **pages, int page_start, int page_end, + gfp_t gfp) { unsigned int cpu, tcpu; int i, err; + ulong addr, nr_pages; + + nr_pages = page_end - page_start; for_each_possible_cpu(cpu) { - err = __pcpu_map_pages(pcpu_chunk_addr(chunk, cpu, page_start), + addr = pcpu_chunk_addr(chunk, cpu, page_start); + err = __pcpu_map_pages(addr, &pages[pcpu_page_idx(cpu, page_start)], - page_end - page_start); + nr_pages); if (err < 0) goto err; + if (gfp & __GFP_GLOBAL_NONSENSITIVE) { + err = asi_map(ASI_GLOBAL_NONSENSITIVE, (void *)addr, + nr_pages * PAGE_SIZE); + if (err) + goto err; + } + for (i = page_start; i < page_end; i++) pcpu_set_page_chunk(pages[pcpu_page_idx(cpu, i)], chunk); @@ -231,10 +253,15 @@ static int pcpu_map_pages(struct pcpu_chunk *chunk, return 0; err: for_each_possible_cpu(tcpu) { + addr = pcpu_chunk_addr(chunk, tcpu, page_start); + + if (gfp & __GFP_GLOBAL_NONSENSITIVE) + asi_unmap(ASI_GLOBAL_NONSENSITIVE, (void *)addr, + nr_pages * PAGE_SIZE, false); + + __pcpu_unmap_pages(addr, nr_pages); if (tcpu == cpu) break; - __pcpu_unmap_pages(pcpu_chunk_addr(chunk, tcpu, page_start), - page_end - page_start); } pcpu_post_unmap_tlb_flush(chunk, page_start, page_end); return err; @@ -285,7 +312,7 @@ static int pcpu_populate_chunk(struct pcpu_chunk *chunk, if (pcpu_alloc_pages(chunk, pages, page_start, page_end, gfp)) return -ENOMEM; - if (pcpu_map_pages(chunk, pages, page_start, page_end)) { + if (pcpu_map_pages(chunk, pages, page_start, page_end, gfp)) { pcpu_free_pages(chunk, pages, page_start, page_end); return -ENOMEM; } @@ -334,13 +361,19 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp) { struct pcpu_chunk *chunk; struct vm_struct **vms; + ulong vm_flags = 0; + + if (static_asi_enabled() && (gfp & __GFP_GLOBAL_NONSENSITIVE)) + vm_flags = VM_GLOBAL_NONSENSITIVE; + + gfp &= ~__GFP_GLOBAL_NONSENSITIVE; chunk = pcpu_alloc_chunk(gfp); if (!chunk) return NULL; vms = pcpu_get_vm_areas(pcpu_group_offsets, pcpu_group_sizes, - pcpu_nr_groups, pcpu_atom_size); + pcpu_nr_groups, pcpu_atom_size, vm_flags); if (!vms) { pcpu_free_chunk(chunk); return NULL; diff --git a/mm/vmalloc.c b/mm/vmalloc.c index ba588a37ee75..f13bfe7e896b 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -3664,10 +3664,10 @@ pvm_determine_end_from_reverse(struct vmap_area **va, unsigned long align) */ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, const size_t *sizes, int nr_vms, - size_t align) + size_t align, ulong flags) { - const unsigned long vmalloc_start = ALIGN(VMALLOC_START, align); - const unsigned long vmalloc_end = VMALLOC_END & ~(align - 1); + unsigned long vmalloc_start = VMALLOC_START; + unsigned long vmalloc_end = VMALLOC_END; struct vmap_area **vas, *va; struct vm_struct **vms; int area, area2, last_area, term_area; @@ -3677,6 +3677,15 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, /* verify parameters and allocate data structures */ BUG_ON(offset_in_page(align) || !is_power_of_2(align)); + + if (static_asi_enabled() && (flags & VM_GLOBAL_NONSENSITIVE)) { + vmalloc_start = VMALLOC_GLOBAL_NONSENSITIVE_START; + vmalloc_end = VMALLOC_GLOBAL_NONSENSITIVE_END; + } + + vmalloc_start = ALIGN(vmalloc_start, align); + vmalloc_end = vmalloc_end & ~(align - 1); + for (last_area = 0, area = 0; area < nr_vms; area++) { start = offsets[area]; end = start + sizes[area]; @@ -3815,7 +3824,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, for (area = 0; area < nr_vms; area++) { insert_vmap_area(vas[area], &vmap_area_root, &vmap_area_list); - setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC, + setup_vmalloc_vm_locked(vms[area], vas[area], flags | VM_ALLOC, pcpu_get_vm_areas); } spin_unlock(&vmap_area_lock); diff --git a/security/Kconfig b/security/Kconfig index 0a3e49d6a331..e89c2658e6cf 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -68,7 +68,7 @@ config PAGE_TABLE_ISOLATION config ADDRESS_SPACE_ISOLATION bool "Allow code to run with a reduced kernel address space" default n - depends on X86_64 && !UML && SLAB + depends on X86_64 && !UML && SLAB && !NEED_PER_CPU_KM depends on !PARAVIRT help This feature provides the ability to run some kernel code From patchwork Wed Feb 23 05:21:53 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756420 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 45C94C433F5 for ; Wed, 23 Feb 2022 05:25:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238263AbiBWFZq (ORCPT ); Wed, 23 Feb 2022 00:25:46 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56740 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238271AbiBWFZh (ORCPT ); Wed, 23 Feb 2022 00:25:37 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 40DE46C1F1 for ; Tue, 22 Feb 2022 21:24:37 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-2d306e372e5so163255917b3.5 for ; Tue, 22 Feb 2022 21:24:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=orMBAJTu5bZfo+ZjtAIW/XRZfJpT6yrC1YB1ci4mdhQ=; b=i488SXIT1e9g9PD4iGqqQ5k0m2F8EhUxsfxeEhnLAG3wqUbodJtbO+eBa1paYhzivd f6qEUMrlJuQyIim27Unr9cJtD2qOlwrJ3Vb92Q7HTAPrjZK/4FZKPxvkHiWfCGgfeGks wPJV2pOqdvyTeI1kLDQCnafd3hNR4RZrBUmWr4GBRfp9dxcFXFHOnNlY5ByTNaX9Ovn7 medb8pTA3zAjMiK0tjiHROjMfziSZ3ZN7ikLLMgh9Do8+8uTWIEbihQ6o2v0V8CMkelT vayq0rCzDDSabPvpcm7/0+wlNG5uCJLTrXlA3oAlDMciITRnUAJzKr5xLZ2ttczqv0Td DGUA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=orMBAJTu5bZfo+ZjtAIW/XRZfJpT6yrC1YB1ci4mdhQ=; b=Mu1ygqrRKoKTxdZF8G0rzpXav1Lvq90+6I1YlsFnv7wJAQ1CYgYNXvbINSwxQ0O0V2 zQWRP24u0wVxtZLn3/Sc8GbIFzobQkcckseXFosxSFXIY/TaGtZztIIrJRssBh+fOUnv UZfpc5oCiD51nLaW8cMi8HHByBhrnZ3AvzjRlX9u2zkGzsWYNDS4oAd76kSkoMgX23b1 vWwD3I5jflHV1rVNCNCzvI3Chi3fqalW7WIsZIC5zOxcuRgzFSusZpHDlgm75ibdgi/o E0tHPCGXLO51et9FfpUW/rPiug3Plo/eE5h9tkDeSPNoeY2rUbHUWMH0gZkSqEK/V3Bh orjA== X-Gm-Message-State: AOAM532puqeVbGs7Yz8BJgMo8zZnbjY84QXs5Z/iS1AMDw2xTq7ztvOB LmLePbZszr7f9VZ0IKEfY2aKc9H7o95k X-Google-Smtp-Source: ABdhPJz6tUJjJr3TTCqxc+szuwfGgWLkVUJz81MU2cNZP+P/jcoAt0UrWZKKXpvg67XLb/VTmlwpxyNA5Syq X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a25:a4e8:0:b0:61e:1eb6:19bd with SMTP id g95-20020a25a4e8000000b0061e1eb619bdmr27268416ybi.168.1645593863676; Tue, 22 Feb 2022 21:24:23 -0800 (PST) Date: Tue, 22 Feb 2022 21:21:53 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-18-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 17/47] mm: asi: Aliased direct map for local non-sensitive allocations From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org This creates a second copy of the direct map, which mirrors the normal direct map in the regular unrestricted kernel page tables. But in the ASI restricted address spaces, the page tables for this aliased direct map would be local to each process. So this aliased map can be used for locally non-sensitive page allocations. Because of the lack of available kernel virtual address space, we have to reduce the max possible direct map size by half. That should be fine with 5 level page tables but could be an issue with 4 level page tables (as max 32 TB RAM could be supported instead of 64 TB). An alternative vmap-style implementation of an aliased local region is possible without this limitation, but that has some other compromises and would be usable only if we trim down the types of structures marked as local non-sensitive by limiting the designation to only those that really are locally non-sensitive but globally sensitive. That is certainly ideal and likely feasible, and would also allow removal of some other relatively complex infrastructure introduced in later patches. But we are including this implementation here just for demonstration of a fully general mechanism. An altogether different alternative to a separate aliased region is also possible by just partitioning the regular direct map (either statically or dynamically via additional page-block types), which is certainly feasible but would require more effort to implement properly. Signed-off-by: Junaid Shahid --- arch/x86/include/asm/page.h | 19 +++++++- arch/x86/include/asm/page_64.h | 25 +++++++++- arch/x86/include/asm/page_64_types.h | 20 ++++++++ arch/x86/kernel/e820.c | 7 ++- arch/x86/mm/asi.c | 69 +++++++++++++++++++++++++++- arch/x86/mm/kaslr.c | 34 +++++++++++++- arch/x86/mm/mm_internal.h | 2 + arch/x86/mm/physaddr.c | 8 ++++ include/linux/page-flags.h | 3 ++ include/trace/events/mmflags.h | 3 +- security/Kconfig | 1 + 11 files changed, 183 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h index 4d5810c8fab7..7688ba9d3542 100644 --- a/arch/x86/include/asm/page.h +++ b/arch/x86/include/asm/page.h @@ -18,6 +18,7 @@ struct page; +#include #include extern struct range pfn_mapped[]; extern int nr_pfn_mapped; @@ -56,8 +57,24 @@ static inline void copy_user_page(void *to, void *from, unsigned long vaddr, __phys_addr_symbol(__phys_reloc_hide((unsigned long)(x))) #ifndef __va -#define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET)) + +#define ___va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET)) + +#ifndef CONFIG_ADDRESS_SPACE_ISOLATION +#define __va(x) ___va(x) +#else + +DECLARE_STATIC_KEY_FALSE(asi_local_map_initialized); +void *asi_va(unsigned long pa); + +/* + * This might significantly increase the size of the jump table. + * If that turns out to be a problem, we should use a non-static branch. + */ +#define __va(x) (static_branch_likely(&asi_local_map_initialized) \ + ? asi_va((unsigned long)(x)) : ___va(x)) #endif +#endif /* __va */ #define __boot_va(x) __va(x) #define __boot_pa(x) __pa(x) diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h index 4bde0dc66100..2845eca02552 100644 --- a/arch/x86/include/asm/page_64.h +++ b/arch/x86/include/asm/page_64.h @@ -5,6 +5,7 @@ #include #ifndef __ASSEMBLY__ +#include #include /* duplicated to the one in bootmem.h */ @@ -15,12 +16,34 @@ extern unsigned long page_offset_base; extern unsigned long vmalloc_base; extern unsigned long vmemmap_base; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + +extern unsigned long asi_local_map_base; +DECLARE_STATIC_KEY_FALSE(asi_local_map_initialized); + +#else + +/* Should never be used if ASI is not enabled */ +#define asi_local_map_base (*(ulong *)NULL) + +#endif + static inline unsigned long __phys_addr_nodebug(unsigned long x) { unsigned long y = x - __START_KERNEL_map; + unsigned long map_start = PAGE_OFFSET; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + /* + * This might significantly increase the size of the jump table. + * If that turns out to be a problem, we should use a non-static branch. + */ + if (static_branch_likely(&asi_local_map_initialized) && + x > ASI_LOCAL_MAP) + map_start = ASI_LOCAL_MAP; +#endif /* use the carry flag to determine if x was < __START_KERNEL_map */ - x = y + ((x > y) ? phys_base : (__START_KERNEL_map - PAGE_OFFSET)); + x = y + ((x > y) ? phys_base : (__START_KERNEL_map - map_start)); return x; } diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h index e9e2c3ba5923..bd27ebe51a8c 100644 --- a/arch/x86/include/asm/page_64_types.h +++ b/arch/x86/include/asm/page_64_types.h @@ -2,6 +2,8 @@ #ifndef _ASM_X86_PAGE_64_DEFS_H #define _ASM_X86_PAGE_64_DEFS_H +#include + #ifndef __ASSEMBLY__ #include #endif @@ -47,6 +49,24 @@ #define __PAGE_OFFSET __PAGE_OFFSET_BASE_L4 #endif /* CONFIG_DYNAMIC_MEMORY_LAYOUT */ +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + +#define __ASI_LOCAL_MAP_BASE (__PAGE_OFFSET + \ + ALIGN(_BITUL(MAX_PHYSMEM_BITS - 1), PGDIR_SIZE)) + +#ifdef CONFIG_DYNAMIC_MEMORY_LAYOUT +#define ASI_LOCAL_MAP asi_local_map_base +#else +#define ASI_LOCAL_MAP __ASI_LOCAL_MAP_BASE +#endif + +#else /* CONFIG_ADDRESS_SPACE_ISOLATION */ + +/* Should never be used if ASI is not enabled */ +#define ASI_LOCAL_MAP (*(ulong *)NULL) + +#endif + #define __START_KERNEL_map _AC(0xffffffff80000000, UL) /* See Documentation/x86/x86_64/mm.rst for a description of the memory map. */ diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index bc0657f0deed..e2ea4d6bfbdf 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -880,6 +880,11 @@ static void __init early_panic(char *msg) static int userdef __initdata; +u64 __init set_phys_mem_limit(u64 size) +{ + return e820__range_remove(size, ULLONG_MAX - size, E820_TYPE_RAM, 1); +} + /* The "mem=nopentium" boot option disables 4MB page tables on 32-bit kernels: */ static int __init parse_memopt(char *p) { @@ -905,7 +910,7 @@ static int __init parse_memopt(char *p) if (mem_size == 0) return -EINVAL; - e820__range_remove(mem_size, ULLONG_MAX - mem_size, E820_TYPE_RAM, 1); + set_phys_mem_limit(mem_size); #ifdef CONFIG_MEMORY_HOTPLUG max_mem_size = mem_size; diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c index 58d1c532274a..38eaa650bac1 100644 --- a/arch/x86/mm/asi.c +++ b/arch/x86/mm/asi.c @@ -22,6 +22,12 @@ EXPORT_PER_CPU_SYMBOL_GPL(asi_cpu_state); __aligned(PAGE_SIZE) pgd_t asi_global_nonsensitive_pgd[PTRS_PER_PGD]; +DEFINE_STATIC_KEY_FALSE(asi_local_map_initialized); +EXPORT_SYMBOL(asi_local_map_initialized); + +unsigned long asi_local_map_base __ro_after_init; +EXPORT_SYMBOL(asi_local_map_base); + int asi_register_class(const char *name, uint flags, const struct asi_hooks *ops) { @@ -181,8 +187,44 @@ static void asi_free_pgd(struct asi *asi) static int __init set_asi_param(char *str) { - if (strcmp(str, "on") == 0) + if (strcmp(str, "on") == 0) { + /* TODO: We should eventually add support for KASAN. */ + if (IS_ENABLED(CONFIG_KASAN)) { + pr_warn("ASI is currently not supported with KASAN"); + return 0; + } + + /* + * We create a second copy of the direct map for the aliased + * ASI Local Map, so we can support only half of the max + * amount of RAM. That should be fine with 5 level page tables + * but could be an issue with 4 level page tables. + * + * An alternative vmap-style implementation of an aliased local + * region is possible without this limitation, but that has + * some other compromises and would be usable only if + * we trim down the types of structures marked as local + * non-sensitive by limiting the designation to only those that + * really are locally non-sensitive but globally sensitive. + * That is certainly ideal and likely feasible, and would also + * allow removal of some other relatively complex infrastructure + * introduced in later patches. But we are including this + * implementation here just for demonstration of a fully general + * mechanism. + * + * An altogether different alternative to a separate aliased + * region is also possible by just partitioning the regular + * direct map (either statically or dynamically via additional + * page-block types), which is certainly feasible but would + * require more effort to implement properly. + */ + if (set_phys_mem_limit(MAXMEM / 2)) + pr_warn("Limiting Memory Size to %llu", MAXMEM / 2); + + asi_local_map_base = __ASI_LOCAL_MAP_BASE; + setup_force_cpu_cap(X86_FEATURE_ASI); + } return 0; } @@ -190,6 +232,8 @@ early_param("asi", set_asi_param); static int __init asi_global_init(void) { + uint i, n; + if (!boot_cpu_has(X86_FEATURE_ASI)) return 0; @@ -203,6 +247,14 @@ static int __init asi_global_init(void) VMALLOC_GLOBAL_NONSENSITIVE_END, "ASI Global Non-sensitive vmalloc"); + /* TODO: We should also handle memory hotplug. */ + n = DIV_ROUND_UP(PFN_PHYS(max_pfn), PGDIR_SIZE); + for (i = 0; i < n; i++) + swapper_pg_dir[pgd_index(ASI_LOCAL_MAP) + i] = + swapper_pg_dir[pgd_index(PAGE_OFFSET) + i]; + + static_branch_enable(&asi_local_map_initialized); + return 0; } subsys_initcall(asi_global_init) @@ -236,7 +288,11 @@ int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi) if (asi->class->flags & ASI_MAP_STANDARD_NONSENSITIVE) { uint i; - for (i = KERNEL_PGD_BOUNDARY; i < PTRS_PER_PGD; i++) + for (i = KERNEL_PGD_BOUNDARY; i < pgd_index(ASI_LOCAL_MAP); i++) + set_pgd(asi->pgd + i, asi_global_nonsensitive_pgd[i]); + + for (i = pgd_index(VMALLOC_GLOBAL_NONSENSITIVE_START); + i < PTRS_PER_PGD; i++) set_pgd(asi->pgd + i, asi_global_nonsensitive_pgd[i]); } @@ -534,3 +590,12 @@ void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len) /* Later patches will do a more optimized flush. */ flush_tlb_kernel_range((ulong)addr, (ulong)addr + len); } + +void *asi_va(unsigned long pa) +{ + struct page *page = pfn_to_page(PHYS_PFN(pa)); + + return (void *)(pa + (PageLocalNonSensitive(page) + ? ASI_LOCAL_MAP : PAGE_OFFSET)); +} +EXPORT_SYMBOL(asi_va); diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c index 557f0fe25dff..2e68ce84767c 100644 --- a/arch/x86/mm/kaslr.c +++ b/arch/x86/mm/kaslr.c @@ -48,6 +48,7 @@ static const unsigned long vaddr_end = CPU_ENTRY_AREA_BASE; static __initdata struct kaslr_memory_region { unsigned long *base; unsigned long size_tb; + unsigned long extra_bytes; } kaslr_regions[] = { { &page_offset_base, 0 }, { &vmalloc_base, 0 }, @@ -57,7 +58,7 @@ static __initdata struct kaslr_memory_region { /* Get size in bytes used by the memory region */ static inline unsigned long get_padding(struct kaslr_memory_region *region) { - return (region->size_tb << TB_SHIFT); + return (region->size_tb << TB_SHIFT) + region->extra_bytes; } /* Initialize base and padding for each memory region randomized with KASLR */ @@ -69,6 +70,8 @@ void __init kernel_randomize_memory(void) struct rnd_state rand_state; unsigned long remain_entropy; unsigned long vmemmap_size; + unsigned int max_physmem_bits = MAX_PHYSMEM_BITS - + !!boot_cpu_has(X86_FEATURE_ASI); vaddr_start = pgtable_l5_enabled() ? __PAGE_OFFSET_BASE_L5 : __PAGE_OFFSET_BASE_L4; vaddr = vaddr_start; @@ -85,7 +88,7 @@ void __init kernel_randomize_memory(void) if (!kaslr_memory_enabled()) return; - kaslr_regions[0].size_tb = 1 << (MAX_PHYSMEM_BITS - TB_SHIFT); + kaslr_regions[0].size_tb = 1 << (max_physmem_bits - TB_SHIFT); kaslr_regions[1].size_tb = VMALLOC_SIZE_TB; /* @@ -100,6 +103,18 @@ void __init kernel_randomize_memory(void) if (memory_tb < kaslr_regions[0].size_tb) kaslr_regions[0].size_tb = memory_tb; + if (boot_cpu_has(X86_FEATURE_ASI)) { + ulong direct_map_size = kaslr_regions[0].size_tb << TB_SHIFT; + + /* Reserve additional space for the ASI Local Map */ + direct_map_size = round_up(direct_map_size, PGDIR_SIZE); + direct_map_size *= 2; + VM_BUG_ON(direct_map_size % (1UL << TB_SHIFT)); + + kaslr_regions[0].size_tb = direct_map_size >> TB_SHIFT; + kaslr_regions[0].extra_bytes = PGDIR_SIZE; + } + /* * Calculate the vmemmap region size in TBs, aligned to a TB * boundary. @@ -136,6 +151,21 @@ void __init kernel_randomize_memory(void) vaddr = round_up(vaddr + 1, PUD_SIZE); remain_entropy -= entropy; } + + /* + * This ensures that the ASI Local Map does not share a PGD entry with + * the regular direct map, and also that the alignment of the two + * regions is the same. + * + * We are relying on the fact that the region following the ASI Local + * Map will be the local non-sensitive portion of the VMALLOC region. + * If that were not the case and the next region was a global one, + * then we would need extra padding after the ASI Local Map to ensure + * that it doesn't share a PGD entry with that global region. + */ + if (cpu_feature_enabled(X86_FEATURE_ASI)) + asi_local_map_base = page_offset_base + PGDIR_SIZE + + ((kaslr_regions[0].size_tb / 2) << TB_SHIFT); } void __meminit init_trampoline_kaslr(void) diff --git a/arch/x86/mm/mm_internal.h b/arch/x86/mm/mm_internal.h index a1e8c523ab08..ace1d0b6d2d9 100644 --- a/arch/x86/mm/mm_internal.h +++ b/arch/x86/mm/mm_internal.h @@ -28,4 +28,6 @@ void update_cache_mode_entry(unsigned entry, enum page_cache_mode cache); extern unsigned long tlb_single_page_flush_ceiling; +u64 set_phys_mem_limit(u64 size); + #endif /* __X86_MM_INTERNAL_H */ diff --git a/arch/x86/mm/physaddr.c b/arch/x86/mm/physaddr.c index fc3f3d3e2ef2..2cd6cee942da 100644 --- a/arch/x86/mm/physaddr.c +++ b/arch/x86/mm/physaddr.c @@ -21,6 +21,9 @@ unsigned long __phys_addr(unsigned long x) x = y + phys_base; VIRTUAL_BUG_ON(y >= KERNEL_IMAGE_SIZE); + } else if (cpu_feature_enabled(X86_FEATURE_ASI) && x > ASI_LOCAL_MAP) { + x -= ASI_LOCAL_MAP; + VIRTUAL_BUG_ON(!phys_addr_valid(x)); } else { x = y + (__START_KERNEL_map - PAGE_OFFSET); @@ -28,6 +31,7 @@ unsigned long __phys_addr(unsigned long x) VIRTUAL_BUG_ON((x > y) || !phys_addr_valid(x)); } + VIRTUAL_BUG_ON(!pfn_valid(x >> PAGE_SHIFT)); return x; } EXPORT_SYMBOL(__phys_addr); @@ -54,6 +58,10 @@ bool __virt_addr_valid(unsigned long x) if (y >= KERNEL_IMAGE_SIZE) return false; + } else if (cpu_feature_enabled(X86_FEATURE_ASI) && x > ASI_LOCAL_MAP) { + x -= ASI_LOCAL_MAP; + if (!phys_addr_valid(x)) + return false; } else { x = y + (__START_KERNEL_map - PAGE_OFFSET); diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index a07434cc679c..e5223a05c41a 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -143,6 +143,7 @@ enum pageflags { #endif #ifdef CONFIG_ADDRESS_SPACE_ISOLATION PG_global_nonsensitive, + PG_local_nonsensitive, #endif __NR_PAGEFLAGS, @@ -547,8 +548,10 @@ PAGEFLAG(Idle, idle, PF_ANY) #ifdef CONFIG_ADDRESS_SPACE_ISOLATION __PAGEFLAG(GlobalNonSensitive, global_nonsensitive, PF_ANY); +__PAGEFLAG(LocalNonSensitive, local_nonsensitive, PF_ANY); #else __PAGEFLAG_FALSE(GlobalNonSensitive, global_nonsensitive); +__PAGEFLAG_FALSE(LocalNonSensitive, local_nonsensitive); #endif #ifdef CONFIG_KASAN_HW_TAGS diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index 73a49197ef54..96e61d838bec 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -129,7 +129,8 @@ IF_HAVE_PG_IDLE(PG_young, "young" ) \ IF_HAVE_PG_IDLE(PG_idle, "idle" ) \ IF_HAVE_PG_ARCH_2(PG_arch_2, "arch_2" ) \ IF_HAVE_PG_SKIP_KASAN_POISON(PG_skip_kasan_poison, "skip_kasan_poison") \ -IF_HAVE_ASI(PG_global_nonsensitive, "global_nonsensitive") +IF_HAVE_ASI(PG_global_nonsensitive, "global_nonsensitive") \ +IF_HAVE_ASI(PG_local_nonsensitive, "local_nonsensitive") #define show_page_flags(flags) \ (flags) ? __print_flags(flags, "|", \ diff --git a/security/Kconfig b/security/Kconfig index e89c2658e6cf..070a948b5266 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -70,6 +70,7 @@ config ADDRESS_SPACE_ISOLATION default n depends on X86_64 && !UML && SLAB && !NEED_PER_CPU_KM depends on !PARAVIRT + depends on !MEMORY_HOTPLUG help This feature provides the ability to run some kernel code with a reduced kernel address space. This can be used to From patchwork Wed Feb 23 05:21:54 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756424 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CD2F0C4332F for ; Wed, 23 Feb 2022 05:25:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238141AbiBWF0A (ORCPT ); Wed, 23 Feb 2022 00:26:00 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58122 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238185AbiBWFZi (ORCPT ); Wed, 23 Feb 2022 00:25:38 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5FD176C921 for ; Tue, 22 Feb 2022 21:24:39 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-2d6bca75aa2so141248237b3.18 for ; Tue, 22 Feb 2022 21:24:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=1WKa+KZIcaa+dJcFTsqZOSts6oD//SyJ1kbRYfoMX2c=; b=sZHRLO9kxK5TzWD0X0seDuzrMm3IaK4jRNNmTtrhccA0nwFV9CPHeE86cjQT27AmkI v7sdoVjafaJDxmnspYK1IJDte5ixXIP4NlIpzJcvsi6oStzTdw2xXhTTK6hSh2PE8S4p mF1cpkFTM3aKbNKaOvkN2Z8/pjjAmAg7yV5uvedIm/mcA50u4mXY9mOc7gt5zMPS72Jb jLpByILet1L0UsVaxDROFUGR6gYkZbc/Z5Ipq4VnAgrCI+T/jtIuqYuHmNGwLB/tOVe/ jw6HvYSt2X5Np3Q4sH2T+yBRKnSfg6rhn+2SA7NrFtkru9npVZHpssAQN1BJYkVWWTVb OPZg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=1WKa+KZIcaa+dJcFTsqZOSts6oD//SyJ1kbRYfoMX2c=; b=gkPZsJOMdSUuJMM1JHifFjBV/sR2RbqqgG+qqUx9GhFwKprgqU3/M7T0GwfCxiekup xGWsaJn87HC6ocRDihPakY3Ob3dDHjRYtqLODNtymUiSjs3TF2QimdQ/5BMUybYYP/Ok hgKZiV8ybdk4zfWPr5be3o2kN7mzqKm4zfqvSjEM0a3IlyiuRtjdP/TELBwqhf2X29BC A1mXxViYXtsTAsfGHR/cpzgHYDKEzeEKuGD7ohNjU1SNnmpZdoRgltol7r1Kr548e0iC eNZ2RTP5+qqB62qk1b4VSljf/BtcYRA9m6Kfm5Ijpuein8yH0CSjtXrUnGr3pnTkroiF 4noQ== X-Gm-Message-State: AOAM531XirV/BhVI//JavWf13belizr6g4MWkNfcUWat5pqSdVXtJWFC mMu6vFsH59Oa8Hz1OJqFABs8qq9s+IrN X-Google-Smtp-Source: ABdhPJzkV7QmjkoELH2kQlDpniWdBZ4HcBwkQjVMUj/MiMOJhgvGDpbnFOCshsSuGHfkjD35goEqd0h2ttTQ X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a81:7848:0:b0:2ca:287c:6ce3 with SMTP id t69-20020a817848000000b002ca287c6ce3mr26938064ywc.392.1645593865848; Tue, 22 Feb 2022 21:24:25 -0800 (PST) Date: Tue, 22 Feb 2022 21:21:54 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-19-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 18/47] mm: asi: Support for pre-ASI-init local non-sensitive allocations From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Local non-sensitive allocations can be made before an actual ASI instance is initialized. To support this, a process-wide pseudo-PGD is created, which contains mappings for all locally non-sensitive allocations. Memory can be mapped into this pseudo-PGD by using ASI_LOCAL_NONSENSITIVE when calling asi_map(). The mappings will be copied to an actual ASI PGD when an ASI instance is initialized in that process, by copying all the PGD entries in the local non-sensitive range from the pseudo-PGD to the ASI PGD. In addition, the page fault handler will copy any new PGD entries that get added after the initialization of the ASI instance. Signed-off-by: Junaid Shahid --- arch/x86/include/asm/asi.h | 6 +++- arch/x86/mm/asi.c | 74 +++++++++++++++++++++++++++++++++++++- arch/x86/mm/fault.c | 7 ++++ include/asm-generic/asi.h | 12 ++++++- kernel/fork.c | 8 +++-- 5 files changed, 102 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h index f69e1f2f09a4..f11010c0334b 100644 --- a/arch/x86/include/asm/asi.h +++ b/arch/x86/include/asm/asi.h @@ -16,6 +16,7 @@ #define ASI_MAX_NUM (1 << ASI_MAX_NUM_ORDER) #define ASI_GLOBAL_NONSENSITIVE (&init_mm.asi[0]) +#define ASI_LOCAL_NONSENSITIVE (¤t->mm->asi[0]) struct asi_state { struct asi *curr_asi; @@ -45,7 +46,8 @@ DECLARE_PER_CPU_ALIGNED(struct asi_state, asi_cpu_state); extern pgd_t asi_global_nonsensitive_pgd[]; -void asi_init_mm_state(struct mm_struct *mm); +int asi_init_mm_state(struct mm_struct *mm); +void asi_free_mm_state(struct mm_struct *mm); int asi_register_class(const char *name, uint flags, const struct asi_hooks *ops); @@ -61,6 +63,8 @@ int asi_map_gfp(struct asi *asi, void *addr, size_t len, gfp_t gfp_flags); int asi_map(struct asi *asi, void *addr, size_t len); void asi_unmap(struct asi *asi, void *addr, size_t len, bool flush_tlb); void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len); +void asi_sync_mapping(struct asi *asi, void *addr, size_t len); +void asi_do_lazy_map(struct asi *asi, size_t addr); static inline void asi_init_thread_state(struct thread_struct *thread) { diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c index 38eaa650bac1..3ba0971a318d 100644 --- a/arch/x86/mm/asi.c +++ b/arch/x86/mm/asi.c @@ -73,6 +73,17 @@ void asi_unregister_class(int index) } EXPORT_SYMBOL_GPL(asi_unregister_class); +static void asi_clone_pgd(pgd_t *dst_table, pgd_t *src_table, size_t addr) +{ + pgd_t *src = pgd_offset_pgd(src_table, addr); + pgd_t *dst = pgd_offset_pgd(dst_table, addr); + + if (!pgd_val(*dst)) + set_pgd(dst, *src); + else + VM_BUG_ON(pgd_val(*dst) != pgd_val(*src)); +} + #ifndef mm_inc_nr_p4ds #define mm_inc_nr_p4ds(mm) do {} while (false) #endif @@ -291,6 +302,11 @@ int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi) for (i = KERNEL_PGD_BOUNDARY; i < pgd_index(ASI_LOCAL_MAP); i++) set_pgd(asi->pgd + i, asi_global_nonsensitive_pgd[i]); + for (i = pgd_index(ASI_LOCAL_MAP); + i <= pgd_index(ASI_LOCAL_MAP + PFN_PHYS(max_possible_pfn)); + i++) + set_pgd(asi->pgd + i, mm->asi[0].pgd[i]); + for (i = pgd_index(VMALLOC_GLOBAL_NONSENSITIVE_START); i < PTRS_PER_PGD; i++) set_pgd(asi->pgd + i, asi_global_nonsensitive_pgd[i]); @@ -379,7 +395,7 @@ void asi_exit(void) } EXPORT_SYMBOL_GPL(asi_exit); -void asi_init_mm_state(struct mm_struct *mm) +int asi_init_mm_state(struct mm_struct *mm) { struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm); @@ -395,6 +411,28 @@ void asi_init_mm_state(struct mm_struct *mm) memcg->use_asi; css_put(&memcg->css); } + + if (!mm->asi_enabled) + return 0; + + mm->asi[0].mm = mm; + mm->asi[0].pgd = (pgd_t *)__get_free_page(GFP_PGTABLE_USER); + if (!mm->asi[0].pgd) + return -ENOMEM; + + return 0; +} + +void asi_free_mm_state(struct mm_struct *mm) +{ + if (!boot_cpu_has(X86_FEATURE_ASI) || !mm->asi_enabled) + return; + + asi_free_pgd_range(&mm->asi[0], pgd_index(ASI_LOCAL_MAP), + pgd_index(ASI_LOCAL_MAP + + PFN_PHYS(max_possible_pfn)) + 1); + + free_page((ulong)mm->asi[0].pgd); } static bool is_page_within_range(size_t addr, size_t page_size, @@ -599,3 +637,37 @@ void *asi_va(unsigned long pa) ? ASI_LOCAL_MAP : PAGE_OFFSET)); } EXPORT_SYMBOL(asi_va); + +static bool is_addr_in_local_nonsensitive_range(size_t addr) +{ + return addr >= ASI_LOCAL_MAP && + addr < VMALLOC_GLOBAL_NONSENSITIVE_START; +} + +void asi_do_lazy_map(struct asi *asi, size_t addr) +{ + if (!static_cpu_has(X86_FEATURE_ASI) || !asi) + return; + + if ((asi->class->flags & ASI_MAP_STANDARD_NONSENSITIVE) && + is_addr_in_local_nonsensitive_range(addr)) + asi_clone_pgd(asi->pgd, asi->mm->asi[0].pgd, addr); +} + +/* + * Should be called after asi_map(ASI_LOCAL_NONSENSITIVE,...) for any mapping + * that is required to exist prior to asi_enter() (e.g. thread stacks) + */ +void asi_sync_mapping(struct asi *asi, void *start, size_t len) +{ + size_t addr = (size_t)start; + size_t end = addr + len; + + if (!static_cpu_has(X86_FEATURE_ASI) || !asi) + return; + + if ((asi->class->flags & ASI_MAP_STANDARD_NONSENSITIVE) && + is_addr_in_local_nonsensitive_range(addr)) + for (; addr < end; addr = pgd_addr_end(addr, end)) + asi_clone_pgd(asi->pgd, asi->mm->asi[0].pgd, addr); +} diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 4bfed53e210e..8692eb50f4a5 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1498,6 +1498,12 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault) { unsigned long address = read_cr2(); irqentry_state_t state; + /* + * There is a very small chance that an NMI could cause an asi_exit() + * before this asi_get_current(), but that is ok, we will just do + * the fixup on the next page fault. + */ + struct asi *asi = asi_get_current(); prefetchw(¤t->mm->mmap_lock); @@ -1539,6 +1545,7 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault) instrumentation_begin(); handle_page_fault(regs, error_code, address); + asi_do_lazy_map(asi, address); instrumentation_end(); irqentry_exit(regs, state); diff --git a/include/asm-generic/asi.h b/include/asm-generic/asi.h index 51c9c4a488e8..a1c8ebff70e8 100644 --- a/include/asm-generic/asi.h +++ b/include/asm-generic/asi.h @@ -13,6 +13,7 @@ #define ASI_MAX_NUM 0 #define ASI_GLOBAL_NONSENSITIVE NULL +#define ASI_LOCAL_NONSENSITIVE NULL #define VMALLOC_GLOBAL_NONSENSITIVE_START VMALLOC_START #define VMALLOC_GLOBAL_NONSENSITIVE_END VMALLOC_END @@ -31,7 +32,9 @@ int asi_register_class(const char *name, uint flags, static inline void asi_unregister_class(int asi_index) { } -static inline void asi_init_mm_state(struct mm_struct *mm) { } +static inline int asi_init_mm_state(struct mm_struct *mm) { return 0; } + +static inline void asi_free_mm_state(struct mm_struct *mm) { } static inline int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi) @@ -67,9 +70,16 @@ static inline int asi_map(struct asi *asi, void *addr, size_t len) return 0; } +static inline +void asi_sync_mapping(struct asi *asi, void *addr, size_t len) { } + static inline void asi_unmap(struct asi *asi, void *addr, size_t len, bool flush_tlb) { } + +static inline +void asi_do_lazy_map(struct asi *asi, size_t addr) { } + static inline void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len) { } diff --git a/kernel/fork.c b/kernel/fork.c index 3695a32ee9bd..dd5a86e913ea 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -699,6 +699,7 @@ void __mmdrop(struct mm_struct *mm) mm_free_pgd(mm); destroy_context(mm); mmu_notifier_subscriptions_destroy(mm); + asi_free_mm_state(mm); check_mm(mm); put_user_ns(mm->user_ns); free_mm(mm); @@ -1072,17 +1073,20 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, mm->def_flags = 0; } - asi_init_mm_state(mm); - if (mm_alloc_pgd(mm)) goto fail_nopgd; if (init_new_context(p, mm)) goto fail_nocontext; + if (asi_init_mm_state(mm)) + goto fail_noasi; + mm->user_ns = get_user_ns(user_ns); + return mm; +fail_noasi: fail_nocontext: mm_free_pgd(mm); fail_nopgd: From patchwork Wed Feb 23 05:21:55 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756425 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D7CB9C433FE for ; Wed, 23 Feb 2022 05:25:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238270AbiBWF0F (ORCPT ); Wed, 23 Feb 2022 00:26:05 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58132 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237287AbiBWFZi (ORCPT ); Wed, 23 Feb 2022 00:25:38 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D78996C938 for ; Tue, 22 Feb 2022 21:24:39 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-2d6bca75aa2so141248567b3.18 for ; Tue, 22 Feb 2022 21:24:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=0QeFswLcX9iEWT+ZD7rOizRcb8mHIR1VarnP+zim9mU=; b=AoOb1g/bwih9VgHUpDsuRj472kE7mvs4Hd5rKJ629FXGqay4OsONzwnNGCn4KSszBQ 7tR0Ehd/9kdVBMulmZqqn0EKhuRWwo5FmNiVUwzvcobRe45iRRTjqmq29bq6BO+aH2aK iVFm5Q63DJUj0wcMPLNxFHfvSAy/DcNXa1BAm45xpqSab7/o/iDvyz2xnot3Gux2i/bz ep5/x08uO4oi4Oa/VRppWnEJLR30fLBFaY+9aHNaO68XzDzENmX6oojUDxShaZpJxCzn AzyiCxL/E+FUObYZMNwPUDfM+bxPv2Rn6QD88aXQUJCPLfukjGOAKhVZbWGz7HVG4WwU EzCA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=0QeFswLcX9iEWT+ZD7rOizRcb8mHIR1VarnP+zim9mU=; b=Y0BPN+TsAuTJQD544v2kWLrgPcdgQRhBi5lQThmRJnhwmX4+t2/UD7NrhihDJ+e1c3 5IYDtKr7K4NRkiqqpv1zKUXVMSWO+/muHnt0P6CqJ0IEhJKzSJ9tdI8xvN9cvvg9t6mY iVlscWZDyBlStn0k5jjfRzbZGksYtYlkatZNvf8KgyzLUOLkUQfC+PINZsmQdMd0oH/F Ut2y2YkJoesu+gF2zFSJKVNLiYSe9TmlvZvx+JzRW/YkBs8BLc4G7XCKkHMWNifqLoH1 IFJutn40oGcw1VH2LTljlrRfudgGayxSaCEuh2aYiOoxChAfc25grCFdCFONBlTD/Mhx hvRg== X-Gm-Message-State: AOAM532mlaBEigmnHfgHHPufT6zNjxzaBDAFBM1yKvE+M26DGaP7o6kz UW5TqHZ+rZ0lJqK5XH4U7JwO1KD9ddYz X-Google-Smtp-Source: ABdhPJx4QwVPvcY0cnBZCAO3p1suuFvqTv6OhknMLXEOsTBctmZQz9YZ5chFPkEZtJ5h60746WaCODirXSsF X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a25:6fc1:0:b0:624:43a0:c16c with SMTP id k184-20020a256fc1000000b0062443a0c16cmr21681170ybc.222.1645593868088; Tue, 22 Feb 2022 21:24:28 -0800 (PST) Date: Tue, 22 Feb 2022 21:21:55 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-20-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 19/47] mm: asi: Support for locally nonsensitive page allocations From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org A new GFP flag, __GFP_LOCAL_NONSENSITIVE, is added to allocate pages that are considered non-sensitive within the context of the current process, but sensitive in the context of other processes. For these allocations, page->asi_mm is set to the current mm during allocation. It must be set to the same value when the page is freed. Though it can potentially be overwritten and used for some other purpose in the meantime, as long as it is restored before freeing. Signed-off-by: Junaid Shahid --- include/linux/gfp.h | 5 +++- include/linux/mm_types.h | 17 ++++++++++-- include/trace/events/mmflags.h | 1 + mm/page_alloc.c | 47 ++++++++++++++++++++++++++++------ tools/perf/builtin-kmem.c | 1 + 5 files changed, 60 insertions(+), 11 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 07a99a463a34..2ab394adbda3 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -62,8 +62,10 @@ struct vm_area_struct; #endif #ifdef CONFIG_ADDRESS_SPACE_ISOLATION #define ___GFP_GLOBAL_NONSENSITIVE 0x4000000u +#define ___GFP_LOCAL_NONSENSITIVE 0x8000000u #else #define ___GFP_GLOBAL_NONSENSITIVE 0 +#define ___GFP_LOCAL_NONSENSITIVE 0 #endif /* If the above are modified, __GFP_BITS_SHIFT may need updating */ @@ -255,9 +257,10 @@ struct vm_area_struct; /* Allocate non-sensitive memory */ #define __GFP_GLOBAL_NONSENSITIVE ((__force gfp_t)___GFP_GLOBAL_NONSENSITIVE) +#define __GFP_LOCAL_NONSENSITIVE ((__force gfp_t)___GFP_LOCAL_NONSENSITIVE) /* Room for N __GFP_FOO bits */ -#define __GFP_BITS_SHIFT 27 +#define __GFP_BITS_SHIFT 28 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) /** diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 8624d2783661..f9702d070975 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -193,8 +193,21 @@ struct page { struct rcu_head rcu_head; #ifdef CONFIG_ADDRESS_SPACE_ISOLATION - /* Links the pages_to_free_async list */ - struct llist_node async_free_node; + struct { + /* Links the pages_to_free_async list */ + struct llist_node async_free_node; + + unsigned long _asi_pad_1; + unsigned long _asi_pad_2; + + /* + * Upon allocation of a locally non-sensitive page, set + * to the allocating mm. Must be set to the same mm when + * the page is freed. May potentially be overwritten in + * the meantime, as long as it is restored before free. + */ + struct mm_struct *asi_mm; + }; #endif }; diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index 96e61d838bec..c00b8a4e1968 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -51,6 +51,7 @@ {(unsigned long)__GFP_KSWAPD_RECLAIM, "__GFP_KSWAPD_RECLAIM"},\ {(unsigned long)__GFP_ZEROTAGS, "__GFP_ZEROTAGS"}, \ {(unsigned long)__GFP_SKIP_KASAN_POISON,"__GFP_SKIP_KASAN_POISON"},\ + {(unsigned long)__GFP_LOCAL_NONSENSITIVE, "__GFP_LOCAL_NONSENSITIVE"},\ {(unsigned long)__GFP_GLOBAL_NONSENSITIVE, "__GFP_GLOBAL_NONSENSITIVE"}\ #define show_gfp_flags(flags) \ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a4048fa1868a..01784bff2a80 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5231,19 +5231,33 @@ early_initcall(asi_page_alloc_init); static int asi_map_alloced_pages(struct page *page, uint order, gfp_t gfp_mask) { uint i; + struct asi *asi; + + VM_BUG_ON((gfp_mask & (__GFP_GLOBAL_NONSENSITIVE | + __GFP_LOCAL_NONSENSITIVE)) == + (__GFP_GLOBAL_NONSENSITIVE | __GFP_LOCAL_NONSENSITIVE)); if (!static_asi_enabled()) return 0; + if (!(gfp_mask & (__GFP_GLOBAL_NONSENSITIVE | + __GFP_LOCAL_NONSENSITIVE))) + return 0; + if (gfp_mask & __GFP_GLOBAL_NONSENSITIVE) { + asi = ASI_GLOBAL_NONSENSITIVE; for (i = 0; i < (1 << order); i++) __SetPageGlobalNonSensitive(page + i); - - return asi_map_gfp(ASI_GLOBAL_NONSENSITIVE, page_to_virt(page), - PAGE_SIZE * (1 << order), gfp_mask); + } else { + asi = ASI_LOCAL_NONSENSITIVE; + for (i = 0; i < (1 << order); i++) { + __SetPageLocalNonSensitive(page + i); + page[i].asi_mm = current->mm; + } } - return 0; + return asi_map_gfp(asi, page_to_virt(page), + PAGE_SIZE * (1 << order), gfp_mask); } static bool asi_unmap_freed_pages(struct page *page, unsigned int order) @@ -5251,18 +5265,28 @@ static bool asi_unmap_freed_pages(struct page *page, unsigned int order) void *va; size_t len; bool async_flush_needed; + struct asi *asi; + + VM_BUG_ON(PageGlobalNonSensitive(page) && PageLocalNonSensitive(page)); if (!static_asi_enabled()) return true; - if (!PageGlobalNonSensitive(page)) + if (PageGlobalNonSensitive(page)) + asi = ASI_GLOBAL_NONSENSITIVE; + else if (PageLocalNonSensitive(page)) + asi = &page->asi_mm->asi[0]; + else return true; + /* Heuristic to check that page->asi_mm is actually an mm_struct */ + VM_BUG_ON(PageLocalNonSensitive(page) && asi->mm != page->asi_mm); + va = page_to_virt(page); len = PAGE_SIZE * (1 << order); async_flush_needed = irqs_disabled() || in_interrupt(); - asi_unmap(ASI_GLOBAL_NONSENSITIVE, va, len, !async_flush_needed); + asi_unmap(asi, va, len, !async_flush_needed); if (!async_flush_needed) return true; @@ -5476,8 +5500,15 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid, return NULL; } - if (static_asi_enabled() && (gfp & __GFP_GLOBAL_NONSENSITIVE)) - gfp |= __GFP_ZERO; + if (static_asi_enabled()) { + if ((gfp & __GFP_LOCAL_NONSENSITIVE) && + !mm_asi_enabled(current->mm)) + gfp &= ~__GFP_LOCAL_NONSENSITIVE; + + if (gfp & (__GFP_GLOBAL_NONSENSITIVE | + __GFP_LOCAL_NONSENSITIVE)) + gfp |= __GFP_ZERO; + } gfp &= gfp_allowed_mask; /* diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c index 5857953cd5c1..a2337fc3404f 100644 --- a/tools/perf/builtin-kmem.c +++ b/tools/perf/builtin-kmem.c @@ -661,6 +661,7 @@ static const struct { { "__GFP_DIRECT_RECLAIM", "DR" }, { "__GFP_KSWAPD_RECLAIM", "KR" }, { "__GFP_GLOBAL_NONSENSITIVE", "GNS" }, + { "__GFP_LOCAL_NONSENSITIVE", "LNS" }, }; static size_t max_gfp_len; From patchwork Wed Feb 23 05:21:56 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756426 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D80BAC433F5 for ; Wed, 23 Feb 2022 05:25:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238347AbiBWF0U (ORCPT ); Wed, 23 Feb 2022 00:26:20 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57922 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238303AbiBWFZl (ORCPT ); Wed, 23 Feb 2022 00:25:41 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 90E8F6C96F for ; Tue, 22 Feb 2022 21:24:42 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id z15-20020a25bb0f000000b00613388c7d99so26693279ybg.8 for ; Tue, 22 Feb 2022 21:24:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=KctzOijb2lPzQRsVweA+t5lqX+AMon2xx7uJ8sEe3V0=; b=o6x+fXjXHT+YmW/ZEagw0M3GAQliRWz6GkGaqA4u+vbUpzhuqjZg6dpP5meVpt0ROf kKIgOplxmZwcy8fpFEZTWPlgRlLYr6BEMXfWuy6EkVszZL1eiXWI04tPu/Bave36tyh0 Pg/rLhImfsoHjz3O2aw8506JPbQ7x/tOlKAxNgPJdEMnSjZ93q2iD4DVERcopQg4JNYp GNKE8XJSp9XzGQ0dIwkM6cgZOM8UkwzVOfAn0JxbE5VvBEVsLLUtWn7Lc0a3H+tJ18Mo xFRsJWo21wVW8cCrGTePkWMS8ReiZ3Pptclz1bJcEt4nXmu0gRlgbulAgKKuDIl4DXV3 55IA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=KctzOijb2lPzQRsVweA+t5lqX+AMon2xx7uJ8sEe3V0=; b=b0dafnJfQzJ7ON/oc58L74POSE18JKy2EqdjuCAinNx0YzREmZUVrdq3Qi+H/kQcr9 dtfN+cJFsgKrWX0slOf/020Vn2m/h2/xkHbZUr8gm6NphM57zVZD6FLxFrHF/hrpNzBm PjM2f4u+IAfp9OSpza2lIlqJGfPS6VYHROupRMQNPNdRH1WvHI+4BuSzEqA17KjGvUnu 5Ohx9+tXij5wORbSxH65FgqUtvTgkq6vyvf+vZ7tcB37pYPuZRXfQFML3eP2ymHJNyN7 7DxofBv5H0ohVIdB/hutN5UM+LUkmenDWld7dEcM5tnq6AsCuQlZLS44exWD6lOsU5nM GPbg== X-Gm-Message-State: AOAM531gPXphyXnMfE6N91Ef27tOg3/6x7Ei914kkIZYMBad+6tLqPAn kmDQHk01OQtSYS/FT45QRTOkP6duztCq X-Google-Smtp-Source: ABdhPJz4jScT7T8fgZ/6SCKV/MEmg4Vf6MpNhLflLFYdfY4h7Ie3CFO275rWWB7jrjHpkdQgBhPjBxZoyl00 X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a25:d90b:0:b0:61d:e8c7:82ff with SMTP id q11-20020a25d90b000000b0061de8c782ffmr26287345ybg.171.1645593870304; Tue, 22 Feb 2022 21:24:30 -0800 (PST) Date: Tue, 22 Feb 2022 21:21:56 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-21-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 20/47] mm: asi: Support for locally non-sensitive vmalloc allocations From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org A new flag, VM_LOCAL_NONSENSITIVE is added to designate locally non-sensitive vmalloc/vmap areas. When using the __vmalloc / __vmalloc_node APIs, if the corresponding GFP flag is specified, the VM flag is automatically added. When using the __vmalloc_node_range API, either flag can be specified independently. The VM flag will only map the vmalloc area as non-sensitive, while the GFP flag will only map the underlying direct map area as non-sensitive. When using the __vmalloc_node_range API, instead of VMALLOC_START/END, VMALLOC_LOCAL_NONSENSITIVE_START/END should be used. This is the range that will have different ASI page tables for each process, thereby providing the local mapping. A command line parameter vmalloc_local_nonsensitive_percent is added to specify the approximate division between the per-process and global vmalloc ranges. Note that regular/sensitive vmalloc/vmap allocations are not restricted by this division and can go anywhere in the entire vmalloc range. The division only applies to non-sensitive allocations. Since no attempt is made to balance regular/sensitive allocations across the division, it is possible that one of these ranges gets filled up by regular allocations, leaving no room for the non-sensitive allocations for which that range was designated. But since the vmalloc range is fairly large, so hopefully that will not be a problem in practice. If that assumption turns out to be incorrect, we could implement a more sophisticated scheme. Signed-off-by: Junaid Shahid --- arch/x86/include/asm/asi.h | 2 + arch/x86/include/asm/page_64.h | 2 + arch/x86/include/asm/pgtable_64_types.h | 7 ++- arch/x86/mm/asi.c | 57 ++++++++++++++++++ include/asm-generic/asi.h | 5 ++ include/linux/vmalloc.h | 6 ++ mm/vmalloc.c | 78 ++++++++++++++++++++----- 7 files changed, 142 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h index f11010c0334b..e3cbf6d8801e 100644 --- a/arch/x86/include/asm/asi.h +++ b/arch/x86/include/asm/asi.h @@ -46,6 +46,8 @@ DECLARE_PER_CPU_ALIGNED(struct asi_state, asi_cpu_state); extern pgd_t asi_global_nonsensitive_pgd[]; +void asi_vmalloc_init(void); + int asi_init_mm_state(struct mm_struct *mm); void asi_free_mm_state(struct mm_struct *mm); diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h index 2845eca02552..b17574349572 100644 --- a/arch/x86/include/asm/page_64.h +++ b/arch/x86/include/asm/page_64.h @@ -18,6 +18,8 @@ extern unsigned long vmemmap_base; #ifdef CONFIG_ADDRESS_SPACE_ISOLATION +extern unsigned long vmalloc_global_nonsensitive_start; +extern unsigned long vmalloc_local_nonsensitive_end; extern unsigned long asi_local_map_base; DECLARE_STATIC_KEY_FALSE(asi_local_map_initialized); diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h index 0fc380ba25b8..06793f7ef1aa 100644 --- a/arch/x86/include/asm/pgtable_64_types.h +++ b/arch/x86/include/asm/pgtable_64_types.h @@ -142,8 +142,13 @@ extern unsigned int ptrs_per_p4d; #define VMALLOC_END (VMALLOC_START + (VMALLOC_SIZE_TB << 40) - 1) #ifdef CONFIG_ADDRESS_SPACE_ISOLATION -#define VMALLOC_GLOBAL_NONSENSITIVE_START VMALLOC_START + +#define VMALLOC_LOCAL_NONSENSITIVE_START VMALLOC_START +#define VMALLOC_LOCAL_NONSENSITIVE_END vmalloc_local_nonsensitive_end + +#define VMALLOC_GLOBAL_NONSENSITIVE_START vmalloc_global_nonsensitive_start #define VMALLOC_GLOBAL_NONSENSITIVE_END VMALLOC_END + #endif #define MODULES_VADDR (__START_KERNEL_map + KERNEL_IMAGE_SIZE) diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c index 3ba0971a318d..91e5ff1224ff 100644 --- a/arch/x86/mm/asi.c +++ b/arch/x86/mm/asi.c @@ -3,6 +3,7 @@ #include #include #include +#include #include #include @@ -28,6 +29,17 @@ EXPORT_SYMBOL(asi_local_map_initialized); unsigned long asi_local_map_base __ro_after_init; EXPORT_SYMBOL(asi_local_map_base); +unsigned long vmalloc_global_nonsensitive_start __ro_after_init; +EXPORT_SYMBOL(vmalloc_global_nonsensitive_start); + +unsigned long vmalloc_local_nonsensitive_end __ro_after_init; +EXPORT_SYMBOL(vmalloc_local_nonsensitive_end); + +/* Approximate percent only. Rounded to PGDIR_SIZE boundary. */ +static uint vmalloc_local_nonsensitive_percent __ro_after_init = 50; +core_param(vmalloc_local_nonsensitive_percent, + vmalloc_local_nonsensitive_percent, uint, 0444); + int asi_register_class(const char *name, uint flags, const struct asi_hooks *ops) { @@ -307,6 +319,10 @@ int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi) i++) set_pgd(asi->pgd + i, mm->asi[0].pgd[i]); + for (i = pgd_index(VMALLOC_LOCAL_NONSENSITIVE_START); + i <= pgd_index(VMALLOC_LOCAL_NONSENSITIVE_END); i++) + set_pgd(asi->pgd + i, mm->asi[0].pgd[i]); + for (i = pgd_index(VMALLOC_GLOBAL_NONSENSITIVE_START); i < PTRS_PER_PGD; i++) set_pgd(asi->pgd + i, asi_global_nonsensitive_pgd[i]); @@ -432,6 +448,10 @@ void asi_free_mm_state(struct mm_struct *mm) pgd_index(ASI_LOCAL_MAP + PFN_PHYS(max_possible_pfn)) + 1); + asi_free_pgd_range(&mm->asi[0], + pgd_index(VMALLOC_LOCAL_NONSENSITIVE_START), + pgd_index(VMALLOC_LOCAL_NONSENSITIVE_END) + 1); + free_page((ulong)mm->asi[0].pgd); } @@ -671,3 +691,40 @@ void asi_sync_mapping(struct asi *asi, void *start, size_t len) for (; addr < end; addr = pgd_addr_end(addr, end)) asi_clone_pgd(asi->pgd, asi->mm->asi[0].pgd, addr); } + +void __init asi_vmalloc_init(void) +{ + uint start_index = pgd_index(VMALLOC_START); + uint end_index = pgd_index(VMALLOC_END); + uint global_start_index; + + if (!boot_cpu_has(X86_FEATURE_ASI)) { + vmalloc_global_nonsensitive_start = VMALLOC_START; + vmalloc_local_nonsensitive_end = VMALLOC_END; + return; + } + + if (vmalloc_local_nonsensitive_percent == 0) { + vmalloc_local_nonsensitive_percent = 1; + pr_warn("vmalloc_local_nonsensitive_percent must be non-zero"); + } + + if (vmalloc_local_nonsensitive_percent >= 100) { + vmalloc_local_nonsensitive_percent = 99; + pr_warn("vmalloc_local_nonsensitive_percent must be less than 100"); + } + + global_start_index = start_index + (end_index - start_index) * + vmalloc_local_nonsensitive_percent / 100; + global_start_index = max(global_start_index, start_index + 1); + + vmalloc_global_nonsensitive_start = -(PTRS_PER_PGD - global_start_index) + * PGDIR_SIZE; + vmalloc_local_nonsensitive_end = vmalloc_global_nonsensitive_start - 1; + + pr_debug("vmalloc_global_nonsensitive_start = %llx", + vmalloc_global_nonsensitive_start); + + VM_BUG_ON(vmalloc_local_nonsensitive_end >= VMALLOC_END); + VM_BUG_ON(vmalloc_global_nonsensitive_start <= VMALLOC_START); +} diff --git a/include/asm-generic/asi.h b/include/asm-generic/asi.h index a1c8ebff70e8..7c50d8b64fa4 100644 --- a/include/asm-generic/asi.h +++ b/include/asm-generic/asi.h @@ -18,6 +18,9 @@ #define VMALLOC_GLOBAL_NONSENSITIVE_START VMALLOC_START #define VMALLOC_GLOBAL_NONSENSITIVE_END VMALLOC_END +#define VMALLOC_LOCAL_NONSENSITIVE_START VMALLOC_START +#define VMALLOC_LOCAL_NONSENSITIVE_END VMALLOC_END + #ifndef _ASSEMBLY_ struct asi_hooks {}; @@ -36,6 +39,8 @@ static inline int asi_init_mm_state(struct mm_struct *mm) { return 0; } static inline void asi_free_mm_state(struct mm_struct *mm) { } +static inline void asi_vmalloc_init(void) { } + static inline int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi) { diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 5f85690f27b6..2b4eafc21fa5 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -41,8 +41,10 @@ struct notifier_block; /* in notifier.h */ #ifdef CONFIG_ADDRESS_SPACE_ISOLATION #define VM_GLOBAL_NONSENSITIVE 0x00000800 /* Similar to __GFP_GLOBAL_NONSENSITIVE */ +#define VM_LOCAL_NONSENSITIVE 0x00001000 /* Similar to __GFP_LOCAL_NONSENSITIVE */ #else #define VM_GLOBAL_NONSENSITIVE 0 +#define VM_LOCAL_NONSENSITIVE 0 #endif /* bits [20..32] reserved for arch specific ioremap internals */ @@ -67,6 +69,10 @@ struct vm_struct { unsigned int nr_pages; phys_addr_t phys_addr; const void *caller; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + /* Valid if flags contain VM_*_NONSENSITIVE */ + struct asi *asi; +#endif }; struct vmap_area { diff --git a/mm/vmalloc.c b/mm/vmalloc.c index f13bfe7e896b..ea94d8a1e2e9 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2391,18 +2391,25 @@ void __init vmalloc_init(void) */ vmap_init_free_space(); vmap_initialized = true; + + asi_vmalloc_init(); } +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + static int asi_map_vm_area(struct vm_struct *area) { if (!static_asi_enabled()) return 0; if (area->flags & VM_GLOBAL_NONSENSITIVE) - return asi_map(ASI_GLOBAL_NONSENSITIVE, area->addr, - get_vm_area_size(area)); + area->asi = ASI_GLOBAL_NONSENSITIVE; + else if (area->flags & VM_LOCAL_NONSENSITIVE) + area->asi = ASI_LOCAL_NONSENSITIVE; + else + return 0; - return 0; + return asi_map(area->asi, area->addr, get_vm_area_size(area)); } static void asi_unmap_vm_area(struct vm_struct *area) @@ -2415,11 +2422,17 @@ static void asi_unmap_vm_area(struct vm_struct *area) * the case when the existing flush from try_purge_vmap_area_lazy() * and/or vm_unmap_aliases() happens non-lazily. */ - if (area->flags & VM_GLOBAL_NONSENSITIVE) - asi_unmap(ASI_GLOBAL_NONSENSITIVE, area->addr, - get_vm_area_size(area), true); + if (area->flags & (VM_GLOBAL_NONSENSITIVE | VM_LOCAL_NONSENSITIVE)) + asi_unmap(area->asi, area->addr, get_vm_area_size(area), true); } +#else + +static inline int asi_map_vm_area(struct vm_struct *area) { return 0; } +static inline void asi_unmap_vm_area(struct vm_struct *area) { } + +#endif + static inline void setup_vmalloc_vm_locked(struct vm_struct *vm, struct vmap_area *va, unsigned long flags, const void *caller) { @@ -2463,6 +2476,15 @@ static struct vm_struct *__get_vm_area_node(unsigned long size, if (unlikely(!size)) return NULL; + if (static_asi_enabled()) { + VM_BUG_ON((flags & VM_LOCAL_NONSENSITIVE) && + !(start >= VMALLOC_LOCAL_NONSENSITIVE_START && + end <= VMALLOC_LOCAL_NONSENSITIVE_END)); + + VM_BUG_ON((flags & VM_GLOBAL_NONSENSITIVE) && + start < VMALLOC_GLOBAL_NONSENSITIVE_START); + } + if (flags & VM_IOREMAP) align = 1ul << clamp_t(int, get_count_order_long(size), PAGE_SHIFT, IOREMAP_MAX_ORDER); @@ -3073,8 +3095,22 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, if (WARN_ON_ONCE(!size)) return NULL; - if (static_asi_enabled() && (vm_flags & VM_GLOBAL_NONSENSITIVE)) - gfp_mask |= __GFP_ZERO; + if (static_asi_enabled()) { + VM_BUG_ON((vm_flags & (VM_LOCAL_NONSENSITIVE | + VM_GLOBAL_NONSENSITIVE)) == + (VM_LOCAL_NONSENSITIVE | VM_GLOBAL_NONSENSITIVE)); + + if ((vm_flags & VM_LOCAL_NONSENSITIVE) && + !mm_asi_enabled(current->mm)) { + vm_flags &= ~VM_LOCAL_NONSENSITIVE; + + if (end == VMALLOC_LOCAL_NONSENSITIVE_END) + end = VMALLOC_END; + } + + if (vm_flags & (VM_GLOBAL_NONSENSITIVE | VM_LOCAL_NONSENSITIVE)) + gfp_mask |= __GFP_ZERO; + } if ((size >> PAGE_SHIFT) > totalram_pages()) { warn_alloc(gfp_mask, NULL, @@ -3166,11 +3202,19 @@ void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask, int node, const void *caller) { ulong vm_flags = 0; + ulong start = VMALLOC_START, end = VMALLOC_END; - if (static_asi_enabled() && (gfp_mask & __GFP_GLOBAL_NONSENSITIVE)) - vm_flags |= VM_GLOBAL_NONSENSITIVE; + if (static_asi_enabled()) { + if (gfp_mask & __GFP_GLOBAL_NONSENSITIVE) { + vm_flags |= VM_GLOBAL_NONSENSITIVE; + start = VMALLOC_GLOBAL_NONSENSITIVE_START; + } else if (gfp_mask & __GFP_LOCAL_NONSENSITIVE) { + vm_flags |= VM_LOCAL_NONSENSITIVE; + end = VMALLOC_LOCAL_NONSENSITIVE_END; + } + } - return __vmalloc_node_range(size, align, VMALLOC_START, VMALLOC_END, + return __vmalloc_node_range(size, align, start, end, gfp_mask, PAGE_KERNEL, vm_flags, node, caller); } /* @@ -3678,9 +3722,15 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, /* verify parameters and allocate data structures */ BUG_ON(offset_in_page(align) || !is_power_of_2(align)); - if (static_asi_enabled() && (flags & VM_GLOBAL_NONSENSITIVE)) { - vmalloc_start = VMALLOC_GLOBAL_NONSENSITIVE_START; - vmalloc_end = VMALLOC_GLOBAL_NONSENSITIVE_END; + if (static_asi_enabled()) { + VM_BUG_ON((flags & (VM_LOCAL_NONSENSITIVE | + VM_GLOBAL_NONSENSITIVE)) == + (VM_LOCAL_NONSENSITIVE | VM_GLOBAL_NONSENSITIVE)); + + if (flags & VM_GLOBAL_NONSENSITIVE) + vmalloc_start = VMALLOC_GLOBAL_NONSENSITIVE_START; + else if (flags & VM_LOCAL_NONSENSITIVE) + vmalloc_end = VMALLOC_LOCAL_NONSENSITIVE_END; } vmalloc_start = ALIGN(vmalloc_start, align); From patchwork Wed Feb 23 05:21:57 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756427 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 770B4C433F5 for ; Wed, 23 Feb 2022 05:26:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238230AbiBWF0Y (ORCPT ); Wed, 23 Feb 2022 00:26:24 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57942 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238317AbiBWFZl (ORCPT ); Wed, 23 Feb 2022 00:25:41 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7E2576D859 for ; Tue, 22 Feb 2022 21:24:46 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id k10-20020a056902070a00b0062469b00335so10829245ybt.14 for ; Tue, 22 Feb 2022 21:24:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=wpN/U5eluX6Jnat+pEDX2C0UlbFhsXsgqK2IMjW00B0=; b=eO68TZsayJUnzF9ppfHaPSwA9f3cpJtiM/bQv1NjIx1viksnomhVCZg/OPfesExCQe GPBpKhmjuCayGTZhMCQafKSmuYZ+smeOmFtUbvbkWS5siPQw7nd2sd5jds4KSpKgIMG1 5rAZMh+2rFOS1AHpWlupbprjH1FaaIPoFdMADMVIZrEncJu5IwxbAAD77jq17vk7nHI/ pHx7Yx3y5+Ja+7mdhxYJ7U+IEaib+spU25N1cAyqYIyEgycYL490w9sjOc0zBhbj55fl oXav/h9TUMbHXVfYpZ1MYXyKyDqxgCsBYAWNUi6HD47X/71tlKLWuj7QNFucy0Ul1WpD v86A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=wpN/U5eluX6Jnat+pEDX2C0UlbFhsXsgqK2IMjW00B0=; b=4zC+VmrzZyqsIKJ1O8fTcxlA4CqsOet3iA5RMQnDOgta9RvXohAY6mFVx1VOiN76DW iK9GoIQLQCINvVs3ZeP1hSQVFiVw/tPfjK8RUgqDzcWWSnN9re70vgpEcO/8Vv1Rxtb7 juHIphPCbYIqeRopASgEnG2n0N6PQI8QBBo+XXubpn6fYs1PjHXNdXLS9+svhNztRy3j m2MRtaP3Dc+Mcr662RCmsGF7B5vzD9iJvpEYyLInmyneuEotkl9koyT1GLIEwTaKPxAR o3nGGw/9SEKxembq9QxfZc+wVAopLWvjxvGP2exhrSxlBQOaVXdSl//bpm1ypSgwWYuQ Vfyw== X-Gm-Message-State: AOAM5312E+nw5UhLgIHy9JNkcefQ0ZmQ/yvLjicJsSPU0sCiEd0gM/LA buLNH2rQaVXxHqc/1c5ZsTsvAariB4X9 X-Google-Smtp-Source: ABdhPJy8YBcGGLS1CDa1jQAvWHy6IX+C9Fx9N9t0vFRpgYOq3ptmbXD9mdu9QbdYBQ3ESjXtAc+9DWkF+HCP X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a81:7141:0:b0:2d3:d549:23f8 with SMTP id m62-20020a817141000000b002d3d54923f8mr27573261ywc.87.1645593872454; Tue, 22 Feb 2022 21:24:32 -0800 (PST) Date: Tue, 22 Feb 2022 21:21:57 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-22-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 21/47] mm: asi: Add support for locally non-sensitive VM_USERMAP pages From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org VM_USERMAP pages can be mapped into userspace, which would overwrite the asi_mm field, so we restore that field when freeing these pages. Signed-off-by: Junaid Shahid --- mm/vmalloc.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index ea94d8a1e2e9..a89866a926f6 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2424,6 +2424,14 @@ static void asi_unmap_vm_area(struct vm_struct *area) */ if (area->flags & (VM_GLOBAL_NONSENSITIVE | VM_LOCAL_NONSENSITIVE)) asi_unmap(area->asi, area->addr, get_vm_area_size(area), true); + + if (area->flags & VM_USERMAP) { + uint i; + + for (i = 0; i < area->nr_pages; i++) + if (PageLocalNonSensitive(area->pages[i])) + area->pages[i]->asi_mm = area->asi->mm; + } } #else From patchwork Wed Feb 23 05:21:58 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756428 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9F59FC433EF for ; Wed, 23 Feb 2022 05:26:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238329AbiBWF02 (ORCPT ); Wed, 23 Feb 2022 00:26:28 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57702 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238325AbiBWFZl (ORCPT ); Wed, 23 Feb 2022 00:25:41 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C3A0D6D87A for ; Tue, 22 Feb 2022 21:24:48 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id b11-20020a5b008b000000b00624ea481d55so2532575ybp.19 for ; Tue, 22 Feb 2022 21:24:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=eHvI/S5CbFiuxrTOsf9OKECf8MKQA40+sVHVsgBWXTI=; b=V0N2AQ61NCzV6BZmqiQxUYs5pGX63pZ6VVRWlptLLGQUiVMxzlkVsCzNfrYnsMXVug KGkm/PYQEu9If/h+TwyTP773A843C1vccvswNpzQzC+CWF0lyH9rv4lYVpwygBHxjjxF KtfQ48TD03/CH3MJc6uQdii7U/orjcd6yROvM49vbkxsBatJT7hHIeSSgYOHhBpNRFtZ ikDCPeVq4k3/1dGPRt9rMDvqd7qp2qvlw0rdUxHJz6cMzz6PpmzwFKFFGSl65Hota3P/ hcaYP68KMD05XrecJbTtx+liHxuoMDRkdxyvRDaTUelX+R7q43SiNNrm4qp+y/niXgLp +m3g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=eHvI/S5CbFiuxrTOsf9OKECf8MKQA40+sVHVsgBWXTI=; b=dHIUEeSOMx5fnegQGm9viOJITsXM9581abAF2WcUl6gXXpWhpmFbUS22vg3tzVXEhG 1x6tlFQwiWQshnNxNBFmRQumZU5Uxy9w9ZZU3FgcL4rF+1WAPEatAXvWpjzAsctH1ikb Md+/RFnKJLuX1QSnvvlVMscI9nHftb5VAhyvCmj9grP1h8+uehgEgNc2OMALmwo0Ft9e 3kqqWWuHqCSibAf+xzX/YqlWFFicglnIYHR1m7yJx8P9mg0Gio4Dt1yYGrmCkomKeyOU iAKGBMqI0WribylqJ1C6Jrigqt4ozN7aD3F7wNVFEbc7WWLZ7DDNrryyZKI6SA7T+MR6 989g== X-Gm-Message-State: AOAM5338lUPABtvbk0mt+IAYKR0WDHeXZNHndKxCg64Go9P2UTLMaOn5 BJByu6V1C35z32iOLIArgYgYsOQn3+rr X-Google-Smtp-Source: ABdhPJylISnqnK8ITQ9nblGksWK/PwAayD9axmVlFF61BCPqFWoHn9afvvtl0MNwA3JR60UblOXBCmfjIzzy X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a81:7951:0:b0:2d6:b7bf:216a with SMTP id u78-20020a817951000000b002d6b7bf216amr24436525ywc.258.1645593874907; Tue, 22 Feb 2022 21:24:34 -0800 (PST) Date: Tue, 22 Feb 2022 21:21:58 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-23-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 22/47] mm: asi: Added refcounting when initilizing an asi From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: Ofir Weisse , kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Ofir Weisse Some KVM tests initilize multiple VMs in a single process. For these cases, we want to suppurt multiple callse to asi_init() before a single asi_destroy is called. We want the initilization to happen exactly once. IF asi_destroy() is called, release the resources only if the counter reached zero. In our current implementation, asi's are tied to a specific mm. This may change in a future implementation. In which case, the mutex for the refcounting will need to move to struct asi. Signed-off-by: Ofir Weisse --- arch/x86/include/asm/asi.h | 1 + arch/x86/mm/asi.c | 52 +++++++++++++++++++++++++++++++++----- include/linux/mm_types.h | 2 ++ kernel/fork.c | 3 +++ 4 files changed, 51 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h index e3cbf6d8801e..2dc465f78bcc 100644 --- a/arch/x86/include/asm/asi.h +++ b/arch/x86/include/asm/asi.h @@ -40,6 +40,7 @@ struct asi { pgd_t *pgd; struct asi_class *class; struct mm_struct *mm; + int64_t asi_ref_count; }; DECLARE_PER_CPU_ALIGNED(struct asi_state, asi_cpu_state); diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c index 91e5ff1224ff..ac35323193a3 100644 --- a/arch/x86/mm/asi.c +++ b/arch/x86/mm/asi.c @@ -282,9 +282,25 @@ static int __init asi_global_init(void) } subsys_initcall(asi_global_init) +/* We're assuming we hold mm->asi_init_lock */ +static void __asi_destroy(struct asi *asi) +{ + if (!boot_cpu_has(X86_FEATURE_ASI)) + return; + + /* If refcount is non-zero, it means asi_init() was called multiple + * times. We free the asi pgd only when the last VM is destroyed. */ + if (--(asi->asi_ref_count) > 0) + return; + + asi_free_pgd(asi); + memset(asi, 0, sizeof(struct asi)); +} + int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi) { - struct asi *asi = &mm->asi[asi_index]; + int err = 0; + struct asi *asi = &mm->asi[asi_index]; *out_asi = NULL; @@ -295,6 +311,15 @@ int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi) WARN_ON(asi_index == 0 || asi_index >= ASI_MAX_NUM); WARN_ON(asi->pgd != NULL); + /* Currently, mm and asi structs are conceptually tied together. In + * future implementations an asi object might be unrelated to a specicic + * mm. In that future implementation - the mutex will have to be inside + * asi. */ + mutex_lock(&mm->asi_init_lock); + + if (asi->asi_ref_count++ > 0) + goto exit_unlock; /* err is 0 */ + /* * For now, we allocate 2 pages to avoid any potential problems with * KPTI code. This won't be needed once KPTI is folded into the ASI @@ -302,8 +327,10 @@ int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi) */ asi->pgd = (pgd_t *)__get_free_pages(GFP_PGTABLE_USER, PGD_ALLOCATION_ORDER); - if (!asi->pgd) - return -ENOMEM; + if (!asi->pgd) { + err = -ENOMEM; + goto exit_unlock; + } asi->class = &asi_class[asi_index]; asi->mm = mm; @@ -328,19 +355,30 @@ int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi) set_pgd(asi->pgd + i, asi_global_nonsensitive_pgd[i]); } - *out_asi = asi; +exit_unlock: + if (err) + __asi_destroy(asi); - return 0; + /* This unlock signals future asi_init() callers that we finished. */ + mutex_unlock(&mm->asi_init_lock); + + if (!err) + *out_asi = asi; + return err; } EXPORT_SYMBOL_GPL(asi_init); void asi_destroy(struct asi *asi) { + struct mm_struct *mm; + if (!boot_cpu_has(X86_FEATURE_ASI) || !asi) return; - asi_free_pgd(asi); - memset(asi, 0, sizeof(struct asi)); + mm = asi->mm; + mutex_lock(&mm->asi_init_lock); + __asi_destroy(asi); + mutex_unlock(&mm->asi_init_lock); } EXPORT_SYMBOL_GPL(asi_destroy); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index f9702d070975..e6980ae31323 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -16,6 +16,7 @@ #include #include #include +#include #include #include @@ -628,6 +629,7 @@ struct mm_struct { * these resources for every mm in the system, we expect that * only VM mm's will have this flag set. */ bool asi_enabled; + struct mutex asi_init_lock; #endif struct user_namespace *user_ns; diff --git a/kernel/fork.c b/kernel/fork.c index dd5a86e913ea..68b3aeab55ac 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1084,6 +1084,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, mm->user_ns = get_user_ns(user_ns); +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + mutex_init(&mm->asi_init_lock); +#endif return mm; fail_noasi: From patchwork Wed Feb 23 05:21:59 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756429 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5421EC433EF for ; Wed, 23 Feb 2022 05:26:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238310AbiBWF0c (ORCPT ); Wed, 23 Feb 2022 00:26:32 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57704 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238332AbiBWFZm (ORCPT ); Wed, 23 Feb 2022 00:25:42 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 723886D95A for ; Tue, 22 Feb 2022 21:24:49 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-2d6b6cf0cafso150372667b3.21 for ; Tue, 22 Feb 2022 21:24:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=migmFG7UQDaW8MeCncNCopIOLc7/KCi4nAx8Y23Kqls=; b=ZIf827+EcfBkAYznkNAzIxjepuJOo5jGJUpKJJAWVlbJLp/v2xx9WndG0Zfv0Y30pp QVoiE4x5/isYQS6r529V5FuPTwvJXepA8/Ig/UwtVbwxtSPhO5eBWW/xZu7GdKR2hwRD noJsJ3Ur8bMBMS1eL71R7PXbr8K1C4T4AmgEfBVK5/+rkrUOvlO+o5qBs5UJm8vfEFH8 iWvYMGLx9ZnbGLPu3ZtgRIXCqWDVhmvwLhi1ORJZdwX1PxFlhSsfJbPaCpw4raGaKYCV 43iNhXkj/THFgjjQb9mrwbCVq0tdVwLBhSI29IV5lijsBJ9zHJayAmn0j12b9dgg/i30 Q+4w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=migmFG7UQDaW8MeCncNCopIOLc7/KCi4nAx8Y23Kqls=; b=MtmvUQzKhCiQXqcZs/hUsPqCHuMddkIclhc7IfIrlEN4uaawNXnPzg0DOJ3BwHzaRI KE3Rn1B43k7JSku9TLo7sXif9v58l2gcjqlc/sMzO5dGuJCtVdtfnrDRSaHPjzsCTVXO wfyNxRAnVGrDg1XUmvWVZ4Hy5vbeDCe80vrl5rBoC/7S7DO+7U+o9GUOK3uvz9wWEq4A keC3xk1wpYy1kEw0lUpugF8Pa/npRp0eqNv5pDRfn4bSnMTg+C3wt1xHPyIWjqiRL9Ak BJuZjK3TXNhRri2zVnkX8HRPxAEhWy79yCwzxLDzSL22rxHONt0TQ+wNZlp1zR4AYHKD 59hQ== X-Gm-Message-State: AOAM531flawy7j1Bm/HliOjuFOsBWia97uyrlTIQLMtprf9OgM4GBqD2 rM7mqbQNBnlbnGXIsKqZe13l5SWFUvZj X-Google-Smtp-Source: ABdhPJzpSBL0VJdokvTlUTI5d1w3wFKyOwiaPENifHATjNeM+qviCG3MD0z5nH5EeDzE1oVe+ERWkd3RD0Jr X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a81:84d5:0:b0:2d1:e85:bf04 with SMTP id u204-20020a8184d5000000b002d10e85bf04mr27926930ywf.465.1645593877093; Tue, 22 Feb 2022 21:24:37 -0800 (PST) Date: Tue, 22 Feb 2022 21:21:59 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-24-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 23/47] mm: asi: Add support for mapping all userspace memory into ASI From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org This adds a new ASI class flag, ASI_MAP_ALL_USERSPACE, which if set, would automatically map all userspace addresses into that ASI address space. This is achieved by lazily cloning the userspace PGD entries during page faults encountered while in that restricted address space. When the userspace PGD entry is cleared (e.g. in munmap()), we go through all restricted address spaces with the ASI_MAP_ALL_USERSPACE flag and clear the corresponding entry in those address spaces as well. Signed-off-by: Junaid Shahid --- arch/x86/include/asm/asi.h | 2 + arch/x86/mm/asi.c | 81 ++++++++++++++++++++++++++++++++++++++ include/asm-generic/asi.h | 7 ++++ mm/memory.c | 2 + 4 files changed, 92 insertions(+) diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h index 2dc465f78bcc..062ccac07fd9 100644 --- a/arch/x86/include/asm/asi.h +++ b/arch/x86/include/asm/asi.h @@ -68,6 +68,8 @@ void asi_unmap(struct asi *asi, void *addr, size_t len, bool flush_tlb); void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len); void asi_sync_mapping(struct asi *asi, void *addr, size_t len); void asi_do_lazy_map(struct asi *asi, size_t addr); +void asi_clear_user_pgd(struct mm_struct *mm, size_t addr); +void asi_clear_user_p4d(struct mm_struct *mm, size_t addr); static inline void asi_init_thread_state(struct thread_struct *thread) { diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c index ac35323193a3..a3d96be76fa9 100644 --- a/arch/x86/mm/asi.c +++ b/arch/x86/mm/asi.c @@ -702,6 +702,41 @@ static bool is_addr_in_local_nonsensitive_range(size_t addr) addr < VMALLOC_GLOBAL_NONSENSITIVE_START; } +static void asi_clone_user_pgd(struct asi *asi, size_t addr) +{ + pgd_t *src = pgd_offset_pgd(asi->mm->pgd, addr); + pgd_t *dst = pgd_offset_pgd(asi->pgd, addr); + pgdval_t old_src, curr_src; + + if (pgd_val(*dst)) + return; + + VM_BUG_ON(!irqs_disabled()); + + /* + * This synchronizes against the PGD entry getting cleared by + * free_pgd_range(). That path has the following steps: + * 1. pgd_clear + * 2. asi_clear_user_pgd + * 3. Remote TLB Flush + * 4. Free page tables + * + * (3) will be blocked for the duration of this function because the + * IPI will remain pending until interrupts are re-enabled. + * + * The following loop ensures that if we read the PGD value before + * (1) and write it after (2), we will re-read the value and write + * the new updated value. + */ + curr_src = pgd_val(*src); + do { + set_pgd(dst, __pgd(curr_src)); + smp_mb(); + old_src = curr_src; + curr_src = pgd_val(*src); + } while (old_src != curr_src); +} + void asi_do_lazy_map(struct asi *asi, size_t addr) { if (!static_cpu_has(X86_FEATURE_ASI) || !asi) @@ -710,6 +745,9 @@ void asi_do_lazy_map(struct asi *asi, size_t addr) if ((asi->class->flags & ASI_MAP_STANDARD_NONSENSITIVE) && is_addr_in_local_nonsensitive_range(addr)) asi_clone_pgd(asi->pgd, asi->mm->asi[0].pgd, addr); + else if ((asi->class->flags & ASI_MAP_ALL_USERSPACE) && + addr < TASK_SIZE_MAX) + asi_clone_user_pgd(asi, addr); } /* @@ -766,3 +804,46 @@ void __init asi_vmalloc_init(void) VM_BUG_ON(vmalloc_local_nonsensitive_end >= VMALLOC_END); VM_BUG_ON(vmalloc_global_nonsensitive_start <= VMALLOC_START); } + +static void __asi_clear_user_pgd(struct mm_struct *mm, size_t addr) +{ + uint i; + + if (!static_cpu_has(X86_FEATURE_ASI) || !mm_asi_enabled(mm)) + return; + + /* + * This function is called right after pgd_clear/p4d_clear. + * We need to be sure that the preceding pXd_clear is visible before + * the ASI pgd clears below. Compare with asi_clone_user_pgd(). + */ + smp_mb__before_atomic(); + + /* + * We need to ensure that the ASI PGD tables do not get freed from + * under us. We can potentially use RCU to avoid that, but since + * this path is probably not going to be too performance sensitive, + * so we just acquire the lock to block asi_destroy(). + */ + mutex_lock(&mm->asi_init_lock); + + for (i = 1; i < ASI_MAX_NUM; i++) + if (mm->asi[i].class && + (mm->asi[i].class->flags & ASI_MAP_ALL_USERSPACE)) + set_pgd(pgd_offset_pgd(mm->asi[i].pgd, addr), + native_make_pgd(0)); + + mutex_unlock(&mm->asi_init_lock); +} + +void asi_clear_user_pgd(struct mm_struct *mm, size_t addr) +{ + if (pgtable_l5_enabled()) + __asi_clear_user_pgd(mm, addr); +} + +void asi_clear_user_p4d(struct mm_struct *mm, size_t addr) +{ + if (!pgtable_l5_enabled()) + __asi_clear_user_pgd(mm, addr); +} diff --git a/include/asm-generic/asi.h b/include/asm-generic/asi.h index 7c50d8b64fa4..8513d0d7865a 100644 --- a/include/asm-generic/asi.h +++ b/include/asm-generic/asi.h @@ -6,6 +6,7 @@ /* ASI class flags */ #define ASI_MAP_STANDARD_NONSENSITIVE 1 +#define ASI_MAP_ALL_USERSPACE 2 #ifndef CONFIG_ADDRESS_SPACE_ISOLATION @@ -85,6 +86,12 @@ void asi_unmap(struct asi *asi, void *addr, size_t len, bool flush_tlb) { } static inline void asi_do_lazy_map(struct asi *asi, size_t addr) { } +static inline +void asi_clear_user_pgd(struct mm_struct *mm, size_t addr) { } + +static inline +void asi_clear_user_p4d(struct mm_struct *mm, size_t addr) { } + static inline void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len) { } diff --git a/mm/memory.c b/mm/memory.c index 8f1de811a1dc..667ece86e051 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -296,6 +296,7 @@ static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d, pud = pud_offset(p4d, start); p4d_clear(p4d); + asi_clear_user_p4d(tlb->mm, start); pud_free_tlb(tlb, pud, start); mm_dec_nr_puds(tlb->mm); } @@ -330,6 +331,7 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd, p4d = p4d_offset(pgd, start); pgd_clear(pgd); + asi_clear_user_pgd(tlb->mm, start); p4d_free_tlb(tlb, p4d, start); } From patchwork Wed Feb 23 05:22:00 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756431 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D383EC433F5 for ; Wed, 23 Feb 2022 05:26:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238319AbiBWF0y (ORCPT ); Wed, 23 Feb 2022 00:26:54 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58104 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238193AbiBWFZx (ORCPT ); Wed, 23 Feb 2022 00:25:53 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 982626E294 for ; Tue, 22 Feb 2022 21:24:52 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id a12-20020a056902056c00b0061dc0f2a94aso26532332ybt.6 for ; Tue, 22 Feb 2022 21:24:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=ztkGsqUqkQcfbxlBdMnsFZA5zf9U5+9bkMDZbcQFZD8=; b=n78DK1+xw1pRWsJW1CFG6TQ6Gavlq4ZQ5W8oD9M2VCMnu8PrU3OyRhoo6bkFuFB9Bp w2aijS13h+Op8okZ6CS3EIZUB1IYhktNxmLMznN4orFxlVaThvl2eVvRB8zbVhIFGOvq jpsqzPo5HOiq6C+toVkqrWMJLKA9J7cIFPDVog0m9lqNblUfFa8H0hzg7NsV5GaTgQ4A 5jBI/lIXeOUGWHVKqN95Y8V5NuPIznfs6f5wBqsu6+GDk2tYuPnXsBHJNOaqpE8YL9hS tuNt6ye+jTS2wLjdTDudRIQlCeKYaeA1rt8TrSn8xLY0Ts3RiaaKU5y9wgWJ9NQ/XyT5 wAiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=ztkGsqUqkQcfbxlBdMnsFZA5zf9U5+9bkMDZbcQFZD8=; b=FlFZPXef6SZBoD79SbGSqfVVj56JPrn/gs5k8ZQgjofYOjvEKIEhmkZgWNdndoD+LW ReAIJHMri8BzRDGfn/mfzqDHUrplwg0r4Z0CVqo+ISxJbOYXAh9XXwaaPcaAZ1ghjdcr G/NelrEfHLIM7RrRG+T9aaJzeQbWIOLQi4f5lo6yJ4mMJ11TutXRnvVt0z2qiHsZsj2X aytzbl/jPX9qP1uBFmc515ryZXZ5aoiuZUApZ10ctupmqGbDqEIe8yVQLNVmJoWMbb8J lmwje03jVATlHrpm5pUUl5pO2OnmU9k7fWzDC7RPDyaKzMjg6H6LUl0SzoL7f+DeIVKQ 3ImQ== X-Gm-Message-State: AOAM531Ek+pD4F3PjOyG3Tv0akBYp2l/ldl6qBuraJ+1TUJ5d4+1XehL hmHMJwusH1/phBvNEzM5weZqyMFHJut+ X-Google-Smtp-Source: ABdhPJzcVkgi3JMg9FekP/jbB0HbpE3UWWhA+GGvHutupthetmKemwuv8RCt3TmZZl6hj5AyHVL2xNuWi5Ck X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a25:7694:0:b0:624:a2d9:c8f0 with SMTP id r142-20020a257694000000b00624a2d9c8f0mr10070639ybc.523.1645593879400; Tue, 22 Feb 2022 21:24:39 -0800 (PST) Date: Tue, 22 Feb 2022 21:22:00 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-25-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 24/47] mm: asi: Support for local non-sensitive slab caches From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org A new flag SLAB_LOCAL_NONSENSITIVE is added to designate that a slab cache can be used for local non-sensitive allocations. For such caches, a per-process child cache will be created when a process tries to make an allocation from that cache for the first time, similar to the per-memcg child caches that used to exist before the object based memcg charging mechanism. (A lot of the infrastructure for handling these child caches is derived from the original per-memcg cache code). If a cache only has SLAB_LOCAL_NONSENSITIVE, then all allocations from that cache will automatically be considered locally non-sensitive. But if a cache has both SLAB_LOCAL_NONSENSITIVE and SLAB_GLOBAL_NONSENSITIVE, then each allocation must specify one of __GFP_LOCAL_NONSENSITIVE or __GFP_GLOBAL_NONSENSITIVE. Note that the first locally non-sensitive allocation that a process makes from a given slab cache must occur from a sleepable context. If that is not the case, then a new kmem_cache_precreate_local* API must be called from a sleepable context before the first allocation. Signed-off-by: Junaid Shahid --- arch/x86/mm/asi.c | 5 + include/linux/mm_types.h | 4 + include/linux/sched/mm.h | 12 ++ include/linux/slab.h | 38 +++- include/linux/slab_def.h | 4 + kernel/fork.c | 3 +- mm/slab.c | 41 ++++- mm/slab.h | 151 +++++++++++++++- mm/slab_common.c | 363 ++++++++++++++++++++++++++++++++++++++- 9 files changed, 602 insertions(+), 19 deletions(-) diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c index a3d96be76fa9..6b9a0f5ab391 100644 --- a/arch/x86/mm/asi.c +++ b/arch/x86/mm/asi.c @@ -4,6 +4,7 @@ #include #include #include +#include #include #include @@ -455,6 +456,8 @@ int asi_init_mm_state(struct mm_struct *mm) memset(mm->asi, 0, sizeof(mm->asi)); mm->asi_enabled = false; + RCU_INIT_POINTER(mm->local_slab_caches, NULL); + mm->local_slab_caches_array_size = 0; /* * TODO: In addition to a cgroup flag, we may also want a per-process @@ -482,6 +485,8 @@ void asi_free_mm_state(struct mm_struct *mm) if (!boot_cpu_has(X86_FEATURE_ASI) || !mm->asi_enabled) return; + free_local_slab_caches(mm); + asi_free_pgd_range(&mm->asi[0], pgd_index(ASI_LOCAL_MAP), pgd_index(ASI_LOCAL_MAP + PFN_PHYS(max_possible_pfn)) + 1); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index e6980ae31323..56511adc263e 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -517,6 +517,10 @@ struct mm_struct { struct asi asi[ASI_MAX_NUM]; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + struct kmem_cache * __rcu *local_slab_caches; + uint local_slab_caches_array_size; +#endif /** * @mm_users: The number of users including userspace. * diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index aca874d33fe6..c9122d4436d4 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -37,9 +37,21 @@ static inline void mmgrab(struct mm_struct *mm) } extern void __mmdrop(struct mm_struct *mm); +extern void mmdrop_async(struct mm_struct *mm); static inline void mmdrop(struct mm_struct *mm) { +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + /* + * We really only need to do this if we are in an atomic context. + * Unfortunately, there doesn't seem to be a reliable way to detect + * atomic context across all kernel configs. So we just always do async. + */ + if (rcu_access_pointer(mm->local_slab_caches)) { + mmdrop_async(mm); + return; + } +#endif /* * The implicit full barrier implied by atomic_dec_and_test() is * required by the membarrier system call before returning to diff --git a/include/linux/slab.h b/include/linux/slab.h index 7b8a3853d827..ef9c73c0d874 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -93,6 +93,8 @@ /* Avoid kmemleak tracing */ #define SLAB_NOLEAKTRACE ((slab_flags_t __force)0x00800000U) +/* 0x01000000U is used below for SLAB_LOCAL_NONSENSITIVE */ + /* Fault injection mark */ #ifdef CONFIG_FAILSLAB # define SLAB_FAILSLAB ((slab_flags_t __force)0x02000000U) @@ -121,8 +123,10 @@ #define SLAB_DEACTIVATED ((slab_flags_t __force)0x10000000U) #ifdef CONFIG_ADDRESS_SPACE_ISOLATION +#define SLAB_LOCAL_NONSENSITIVE ((slab_flags_t __force)0x01000000U) #define SLAB_GLOBAL_NONSENSITIVE ((slab_flags_t __force)0x20000000U) #else +#define SLAB_LOCAL_NONSENSITIVE 0 #define SLAB_GLOBAL_NONSENSITIVE 0 #endif @@ -377,7 +381,8 @@ static __always_inline struct kmem_cache *get_kmalloc_cache(gfp_t flags, { #ifdef CONFIG_ADDRESS_SPACE_ISOLATION - if (static_asi_enabled() && (flags & __GFP_GLOBAL_NONSENSITIVE)) + if (static_asi_enabled() && + (flags & (__GFP_GLOBAL_NONSENSITIVE | __GFP_LOCAL_NONSENSITIVE))) return nonsensitive_kmalloc_caches[kmalloc_type(flags)][index]; #endif return kmalloc_caches[kmalloc_type(flags)][index]; @@ -800,4 +805,35 @@ int slab_dead_cpu(unsigned int cpu); #define slab_dead_cpu NULL #endif +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + +struct kmem_cache *get_local_kmem_cache(struct kmem_cache *s, + struct mm_struct *mm, gfp_t flags); +void free_local_slab_caches(struct mm_struct *mm); +int kmem_cache_precreate_local(struct kmem_cache *s); +int kmem_cache_precreate_local_kmalloc(size_t size, gfp_t flags); + +#else + +static inline +struct kmem_cache *get_local_kmem_cache(struct kmem_cache *s, + struct mm_struct *mm, gfp_t flags) +{ + return NULL; +} + +static inline void free_local_slab_caches(struct mm_struct *mm) { } + +static inline int kmem_cache_precreate_local(struct kmem_cache *s) +{ + return 0; +} + +static inline int kmem_cache_precreate_local_kmalloc(size_t size, gfp_t flags) +{ + return 0; +} + +#endif + #endif /* _LINUX_SLAB_H */ diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h index 3aa5e1e73ab6..53cbc1f40031 100644 --- a/include/linux/slab_def.h +++ b/include/linux/slab_def.h @@ -81,6 +81,10 @@ struct kmem_cache { unsigned int *random_seq; #endif +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + struct kmem_local_cache_info local_cache_info; +#endif + unsigned int useroffset; /* Usercopy region offset */ unsigned int usersize; /* Usercopy region size */ diff --git a/kernel/fork.c b/kernel/fork.c index 68b3aeab55ac..d7f55de00947 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -714,13 +714,14 @@ static void mmdrop_async_fn(struct work_struct *work) __mmdrop(mm); } -static void mmdrop_async(struct mm_struct *mm) +void mmdrop_async(struct mm_struct *mm) { if (unlikely(atomic_dec_and_test(&mm->mm_count))) { INIT_WORK(&mm->async_put_work, mmdrop_async_fn); schedule_work(&mm->async_put_work); } } +EXPORT_SYMBOL(mmdrop_async); static inline void free_signal_struct(struct signal_struct *sig) { diff --git a/mm/slab.c b/mm/slab.c index 5a928d95d67b..44cf6d127a4c 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -1403,6 +1403,8 @@ static void kmem_freepages(struct kmem_cache *cachep, struct page *page) /* In union with page->mapping where page allocator expects NULL */ page->slab_cache = NULL; + restore_page_nonsensitive_metadata(page, cachep); + if (current->reclaim_state) current->reclaim_state->reclaimed_slab += 1 << order; unaccount_slab_page(page, order, cachep); @@ -2061,11 +2063,9 @@ int __kmem_cache_create(struct kmem_cache *cachep, slab_flags_t flags) cachep->allocflags |= GFP_DMA32; if (flags & SLAB_RECLAIM_ACCOUNT) cachep->allocflags |= __GFP_RECLAIMABLE; - if (flags & SLAB_GLOBAL_NONSENSITIVE) - cachep->allocflags |= __GFP_GLOBAL_NONSENSITIVE; cachep->size = size; cachep->reciprocal_buffer_size = reciprocal_value(size); - + set_nonsensitive_cache_params(cachep); #if DEBUG /* * If we're going to use the generic kernel_map_pages() @@ -3846,8 +3846,8 @@ static int setup_kmem_cache_nodes(struct kmem_cache *cachep, gfp_t gfp) } /* Always called with the slab_mutex held */ -static int do_tune_cpucache(struct kmem_cache *cachep, int limit, - int batchcount, int shared, gfp_t gfp) +static int __do_tune_cpucache(struct kmem_cache *cachep, int limit, + int batchcount, int shared, gfp_t gfp) { struct array_cache __percpu *cpu_cache, *prev; int cpu; @@ -3892,6 +3892,29 @@ static int do_tune_cpucache(struct kmem_cache *cachep, int limit, return setup_kmem_cache_nodes(cachep, gfp); } +static int do_tune_cpucache(struct kmem_cache *cachep, int limit, + int batchcount, int shared, gfp_t gfp) +{ + int ret; + struct kmem_cache *c; + + ret = __do_tune_cpucache(cachep, limit, batchcount, shared, gfp); + + if (slab_state < FULL) + return ret; + + if ((ret < 0) || !is_root_cache(cachep)) + return ret; + + lockdep_assert_held(&slab_mutex); + for_each_child_cache(c, cachep) { + /* return value determined by the root cache only */ + __do_tune_cpucache(c, limit, batchcount, shared, gfp); + } + + return ret; +} + /* Called with slab_mutex held always */ static int enable_cpucache(struct kmem_cache *cachep, gfp_t gfp) { @@ -3904,6 +3927,14 @@ static int enable_cpucache(struct kmem_cache *cachep, gfp_t gfp) if (err) goto end; + if (!is_root_cache(cachep)) { + struct kmem_cache *root = get_root_cache(cachep); + + limit = root->limit; + shared = root->shared; + batchcount = root->batchcount; + } + /* * The head array serves three purposes: * - create a LIFO ordering, i.e. return objects that are cache-warm diff --git a/mm/slab.h b/mm/slab.h index f190f4fc0286..b9e11038be27 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -5,6 +5,45 @@ * Internal slab definitions */ +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + +struct kmem_local_cache_info { + /* Valid for child caches. NULL for the root cache itself. */ + struct kmem_cache *root_cache; + union { + /* For root caches */ + struct { + int cache_id; + struct list_head __root_caches_node; + struct list_head children; + /* + * For SLAB_LOCAL_NONSENSITIVE root caches, this points + * to the cache to be used for local non-sensitive + * allocations from processes without ASI enabled. + * + * For root caches with only SLAB_LOCAL_NONSENSITIVE, + * the root cache itself is used as the sensitive cache. + * + * For root caches with both SLAB_LOCAL_NONSENSITIVE and + * SLAB_GLOBAL_NONSENSITIVE, the sensitive cache will be + * a child cache allocated on-demand. + * + * For non-sensiitve kmalloc caches, the sensitive cache + * will just be the corresponding regular kmalloc cache. + */ + struct kmem_cache *sensitive_cache; + }; + + /* For child (process-local) caches */ + struct { + struct mm_struct *mm; + struct list_head children_node; + }; + }; +}; + +#endif + #ifdef CONFIG_SLOB /* * Common fields provided in kmem_cache by all slab allocators @@ -128,8 +167,7 @@ static inline slab_flags_t kmem_cache_flags(unsigned int object_size, } #endif -/* This will also include SLAB_LOCAL_NONSENSITIVE in a later patch. */ -#define SLAB_NONSENSITIVE SLAB_GLOBAL_NONSENSITIVE +#define SLAB_NONSENSITIVE (SLAB_GLOBAL_NONSENSITIVE | SLAB_LOCAL_NONSENSITIVE) /* Legal flag mask for kmem_cache_create(), for various configurations */ #define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | \ @@ -251,6 +289,99 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla return false; } +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + +/* List of all root caches. */ +extern struct list_head slab_root_caches; +#define root_caches_node local_cache_info.__root_caches_node + +/* + * Iterate over all child caches of the given root cache. The caller must hold + * slab_mutex. + */ +#define for_each_child_cache(iter, root) \ + list_for_each_entry(iter, &(root)->local_cache_info.children, \ + local_cache_info.children_node) + +static inline bool is_root_cache(struct kmem_cache *s) +{ + return !s->local_cache_info.root_cache; +} + +static inline bool slab_equal_or_root(struct kmem_cache *s, + struct kmem_cache *p) +{ + return p == s || p == s->local_cache_info.root_cache; +} + +/* + * We use suffixes to the name in child caches because we can't have caches + * created in the system with the same name. But when we print them + * locally, better refer to them with the base name + */ +static inline const char *cache_name(struct kmem_cache *s) +{ + if (!is_root_cache(s)) + s = s->local_cache_info.root_cache; + return s->name; +} + +static inline struct kmem_cache *get_root_cache(struct kmem_cache *s) +{ + if (is_root_cache(s)) + return s; + return s->local_cache_info.root_cache; +} + +static inline +void restore_page_nonsensitive_metadata(struct page *page, + struct kmem_cache *cachep) +{ + if (PageLocalNonSensitive(page)) { + VM_BUG_ON(is_root_cache(cachep)); + page->asi_mm = cachep->local_cache_info.mm; + } +} + +void set_nonsensitive_cache_params(struct kmem_cache *s); + +#else /* CONFIG_ADDRESS_SPACE_ISOLATION */ + +#define slab_root_caches slab_caches +#define root_caches_node list + +#define for_each_child_cache(iter, root) \ + for ((void)(iter), (void)(root); 0; ) + +static inline bool is_root_cache(struct kmem_cache *s) +{ + return true; +} + +static inline bool slab_equal_or_root(struct kmem_cache *s, + struct kmem_cache *p) +{ + return s == p; +} + +static inline const char *cache_name(struct kmem_cache *s) +{ + return s->name; +} + +static inline struct kmem_cache *get_root_cache(struct kmem_cache *s) +{ + return s; +} + +static inline void restore_page_nonsensitive_metadata(struct page *page, + struct kmem_cache *cachep) +{ } + +static inline void set_nonsensitive_cache_params(struct kmem_cache *s) { } + +#endif /* CONFIG_ADDRESS_SPACE_ISOLATION */ + #ifdef CONFIG_MEMCG_KMEM int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s, gfp_t gfp, bool new_page); @@ -449,11 +580,12 @@ static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x) struct kmem_cache *cachep; if (!IS_ENABLED(CONFIG_SLAB_FREELIST_HARDENED) && + !(s->flags & SLAB_LOCAL_NONSENSITIVE) && !kmem_cache_debug_flags(s, SLAB_CONSISTENCY_CHECKS)) return s; cachep = virt_to_cache(x); - if (WARN(cachep && cachep != s, + if (WARN(cachep && !slab_equal_or_root(cachep, s), "%s: Wrong slab cache. %s but object is from %s\n", __func__, s->name, cachep->name)) print_tracking(cachep, x); @@ -501,11 +633,24 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, if (static_asi_enabled()) { VM_BUG_ON(!(s->flags & SLAB_GLOBAL_NONSENSITIVE) && (flags & __GFP_GLOBAL_NONSENSITIVE)); + VM_BUG_ON(!(s->flags & SLAB_LOCAL_NONSENSITIVE) && + (flags & __GFP_LOCAL_NONSENSITIVE)); + VM_BUG_ON((s->flags & SLAB_NONSENSITIVE) == SLAB_NONSENSITIVE && + !(flags & (__GFP_LOCAL_NONSENSITIVE | + __GFP_GLOBAL_NONSENSITIVE))); } if (should_failslab(s, flags)) return NULL; + if (static_asi_enabled() && + (!(flags & __GFP_GLOBAL_NONSENSITIVE) && + (s->flags & SLAB_LOCAL_NONSENSITIVE))) { + s = get_local_kmem_cache(s, current->mm, flags); + if (!s) + return NULL; + } + if (!memcg_slab_pre_alloc_hook(s, objcgp, size, flags)) return NULL; diff --git a/mm/slab_common.c b/mm/slab_common.c index 72dee2494bf8..b486b72d6344 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -42,6 +42,13 @@ static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work); static DECLARE_WORK(slab_caches_to_rcu_destroy_work, slab_caches_to_rcu_destroy_workfn); +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + +static DEFINE_IDA(nonsensitive_cache_ids); +static uint max_num_local_slab_caches = 32; + +#endif + /* * Set of flags that will prevent slab merging */ @@ -131,6 +138,69 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr, return i; } +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + +LIST_HEAD(slab_root_caches); + +static void init_local_cache_info(struct kmem_cache *s, struct kmem_cache *root) +{ + if (root) { + s->local_cache_info.root_cache = root; + list_add(&s->local_cache_info.children_node, + &root->local_cache_info.children); + } else { + s->local_cache_info.cache_id = -1; + INIT_LIST_HEAD(&s->local_cache_info.children); + list_add(&s->root_caches_node, &slab_root_caches); + } +} + +static void cleanup_local_cache_info(struct kmem_cache *s) +{ + if (is_root_cache(s)) { + VM_BUG_ON(!list_empty(&s->local_cache_info.children)); + + list_del(&s->root_caches_node); + if (s->local_cache_info.cache_id >= 0) + ida_free(&nonsensitive_cache_ids, + s->local_cache_info.cache_id); + } else { + struct mm_struct *mm = s->local_cache_info.mm; + struct kmem_cache *root_cache = s->local_cache_info.root_cache; + int id = root_cache->local_cache_info.cache_id; + + list_del(&s->local_cache_info.children_node); + if (mm) { + struct kmem_cache **local_caches = + rcu_dereference_protected(mm->local_slab_caches, + lockdep_is_held(&slab_mutex)); + local_caches[id] = NULL; + } + } +} + +void set_nonsensitive_cache_params(struct kmem_cache *s) +{ + if (s->flags & SLAB_GLOBAL_NONSENSITIVE) { + s->allocflags |= __GFP_GLOBAL_NONSENSITIVE; + VM_BUG_ON(!is_root_cache(s)); + } else if (s->flags & SLAB_LOCAL_NONSENSITIVE) { + if (is_root_cache(s)) + s->local_cache_info.sensitive_cache = s; + else + s->allocflags |= __GFP_LOCAL_NONSENSITIVE; + } +} + +#else + +static inline +void init_local_cache_info(struct kmem_cache *s, struct kmem_cache *root) { } + +static inline void cleanup_local_cache_info(struct kmem_cache *s) { } + +#endif /* CONFIG_ADDRESS_SPACE_ISOLATION */ + /* * Figure out what the alignment of the objects will be given a set of * flags, a user specified alignment and the size of the objects. @@ -168,6 +238,9 @@ int slab_unmergeable(struct kmem_cache *s) if (slab_nomerge || (s->flags & SLAB_NEVER_MERGE)) return 1; + if (!is_root_cache(s)) + return 1; + if (s->ctor) return 1; @@ -202,7 +275,7 @@ struct kmem_cache *find_mergeable(unsigned int size, unsigned int align, if (flags & SLAB_NEVER_MERGE) return NULL; - list_for_each_entry_reverse(s, &slab_caches, list) { + list_for_each_entry_reverse(s, &slab_root_caches, root_caches_node) { if (slab_unmergeable(s)) continue; @@ -254,6 +327,8 @@ static struct kmem_cache *create_cache(const char *name, s->useroffset = useroffset; s->usersize = usersize; + init_local_cache_info(s, root_cache); + err = __kmem_cache_create(s, flags); if (err) goto out_free_cache; @@ -266,6 +341,7 @@ static struct kmem_cache *create_cache(const char *name, return s; out_free_cache: + cleanup_local_cache_info(s); kmem_cache_free(kmem_cache, s); goto out; } @@ -459,6 +535,7 @@ static int shutdown_cache(struct kmem_cache *s) return -EBUSY; list_del(&s->list); + cleanup_local_cache_info(s); if (s->flags & SLAB_TYPESAFE_BY_RCU) { #ifdef SLAB_SUPPORTS_SYSFS @@ -480,6 +557,36 @@ static int shutdown_cache(struct kmem_cache *s) return 0; } +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + +static int shutdown_child_caches(struct kmem_cache *s) +{ + struct kmem_cache *c, *c2; + int r; + + VM_BUG_ON(!is_root_cache(s)); + + lockdep_assert_held(&slab_mutex); + + list_for_each_entry_safe(c, c2, &s->local_cache_info.children, + local_cache_info.children_node) { + r = shutdown_cache(c); + if (r) + return r; + } + + return 0; +} + +#else + +static inline int shutdown_child_caches(struct kmem_cache *s) +{ + return 0; +} + +#endif /* CONFIG_ADDRESS_SPACE_ISOLATION */ + void slab_kmem_cache_release(struct kmem_cache *s) { __kmem_cache_release(s); @@ -501,7 +608,10 @@ void kmem_cache_destroy(struct kmem_cache *s) if (s->refcount) goto out_unlock; - err = shutdown_cache(s); + err = shutdown_child_caches(s); + if (!err) + err = shutdown_cache(s); + if (err) { pr_err("%s %s: Slab cache still has objects\n", __func__, s->name); @@ -651,6 +761,8 @@ void __init create_boot_cache(struct kmem_cache *s, const char *name, s->useroffset = useroffset; s->usersize = usersize; + init_local_cache_info(s, NULL); + err = __kmem_cache_create(s, flags); if (err) @@ -897,6 +1009,13 @@ new_kmalloc_cache(int idx, enum kmalloc_cache_type type, slab_flags_t flags) */ if (IS_ENABLED(CONFIG_MEMCG_KMEM) && (type == KMALLOC_NORMAL)) caches[type][idx]->refcount = -1; + +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + + if (flags & SLAB_NONSENSITIVE) + caches[type][idx]->local_cache_info.sensitive_cache = + kmalloc_caches[type][idx]; +#endif } /* @@ -1086,12 +1205,12 @@ static void print_slabinfo_header(struct seq_file *m) void *slab_start(struct seq_file *m, loff_t *pos) { mutex_lock(&slab_mutex); - return seq_list_start(&slab_caches, *pos); + return seq_list_start(&slab_root_caches, *pos); } void *slab_next(struct seq_file *m, void *p, loff_t *pos) { - return seq_list_next(p, &slab_caches, pos); + return seq_list_next(p, &slab_root_caches, pos); } void slab_stop(struct seq_file *m, void *p) @@ -1099,6 +1218,24 @@ void slab_stop(struct seq_file *m, void *p) mutex_unlock(&slab_mutex); } +static void +accumulate_children_slabinfo(struct kmem_cache *s, struct slabinfo *info) +{ + struct kmem_cache *c; + struct slabinfo sinfo; + + for_each_child_cache(c, s) { + memset(&sinfo, 0, sizeof(sinfo)); + get_slabinfo(c, &sinfo); + + info->active_slabs += sinfo.active_slabs; + info->num_slabs += sinfo.num_slabs; + info->shared_avail += sinfo.shared_avail; + info->active_objs += sinfo.active_objs; + info->num_objs += sinfo.num_objs; + } +} + static void cache_show(struct kmem_cache *s, struct seq_file *m) { struct slabinfo sinfo; @@ -1106,8 +1243,10 @@ static void cache_show(struct kmem_cache *s, struct seq_file *m) memset(&sinfo, 0, sizeof(sinfo)); get_slabinfo(s, &sinfo); + accumulate_children_slabinfo(s, &sinfo); + seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", - s->name, sinfo.active_objs, sinfo.num_objs, s->size, + cache_name(s), sinfo.active_objs, sinfo.num_objs, s->size, sinfo.objects_per_slab, (1 << sinfo.cache_order)); seq_printf(m, " : tunables %4u %4u %4u", @@ -1120,9 +1259,9 @@ static void cache_show(struct kmem_cache *s, struct seq_file *m) static int slab_show(struct seq_file *m, void *p) { - struct kmem_cache *s = list_entry(p, struct kmem_cache, list); + struct kmem_cache *s = list_entry(p, struct kmem_cache, root_caches_node); - if (p == slab_caches.next) + if (p == slab_root_caches.next) print_slabinfo_header(m); cache_show(s, m); return 0; @@ -1148,14 +1287,14 @@ void dump_unreclaimable_slab(void) pr_info("Unreclaimable slab info:\n"); pr_info("Name Used Total\n"); - list_for_each_entry(s, &slab_caches, list) { + list_for_each_entry(s, &slab_root_caches, root_caches_node) { if (s->flags & SLAB_RECLAIM_ACCOUNT) continue; get_slabinfo(s, &sinfo); if (sinfo.num_objs > 0) - pr_info("%-17s %10luKB %10luKB\n", s->name, + pr_info("%-17s %10luKB %10luKB\n", cache_name(s), (sinfo.active_objs * s->size) / 1024, (sinfo.num_objs * s->size) / 1024); } @@ -1361,3 +1500,209 @@ int should_failslab(struct kmem_cache *s, gfp_t gfpflags) return 0; } ALLOW_ERROR_INJECTION(should_failslab, ERRNO); + +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + +static int resize_local_slab_caches_array(struct mm_struct *mm, gfp_t flags) +{ + struct kmem_cache **new_array; + struct kmem_cache **old_array = + rcu_dereference_protected(mm->local_slab_caches, + lockdep_is_held(&slab_mutex)); + + new_array = kcalloc(max_num_local_slab_caches, + sizeof(struct kmem_cache *), flags); + if (!new_array) + return -ENOMEM; + + if (old_array) + memcpy(new_array, old_array, mm->local_slab_caches_array_size * + sizeof(struct kmem_cache *)); + + rcu_assign_pointer(mm->local_slab_caches, new_array); + smp_store_release(&mm->local_slab_caches_array_size, + max_num_local_slab_caches); + + if (old_array) { + synchronize_rcu(); + kfree(old_array); + } + + return 0; +} + +static int get_or_alloc_cache_id(struct kmem_cache *root_cache, gfp_t flags) +{ + int id = root_cache->local_cache_info.cache_id; + + if (id >= 0) + return id; + + id = ida_alloc_max(&nonsensitive_cache_ids, + max_num_local_slab_caches - 1, flags); + if (id == -ENOSPC) { + max_num_local_slab_caches *= 2; + id = ida_alloc_max(&nonsensitive_cache_ids, + max_num_local_slab_caches - 1, flags); + } + + if (id >= 0) + root_cache->local_cache_info.cache_id = id; + + return id; +} + +static struct kmem_cache *create_local_kmem_cache(struct kmem_cache *root_cache, + struct mm_struct *mm, + gfp_t flags) +{ + char *name; + struct kmem_cache *s = NULL; + slab_flags_t slab_flags = root_cache->flags & CACHE_CREATE_MASK; + struct kmem_cache **cache_ptr; + + flags &= GFP_RECLAIM_MASK; + + mutex_lock(&slab_mutex); + + if (mm_asi_enabled(mm)) { + struct kmem_cache **caches; + int id = get_or_alloc_cache_id(root_cache, flags); + + if (id < 0) + goto out; + + flags |= __GFP_ACCOUNT; + + if (mm->local_slab_caches_array_size <= id && + resize_local_slab_caches_array(mm, flags) < 0) + goto out; + + caches = rcu_dereference_protected(mm->local_slab_caches, + lockdep_is_held(&slab_mutex)); + cache_ptr = &caches[id]; + if (*cache_ptr) { + s = *cache_ptr; + goto out; + } + + slab_flags &= ~SLAB_GLOBAL_NONSENSITIVE; + name = kasprintf(flags, "%s(%d:%s)", root_cache->name, + task_pid_nr(mm->owner), mm->owner->comm); + if (!name) + goto out; + + } else { + cache_ptr = &root_cache->local_cache_info.sensitive_cache; + if (*cache_ptr) { + s = *cache_ptr; + goto out; + } + + slab_flags &= ~SLAB_NONSENSITIVE; + name = kasprintf(flags, "%s(sensitive)", root_cache->name); + if (!name) + goto out; + } + + s = create_cache(name, + root_cache->object_size, + root_cache->align, + slab_flags, + root_cache->useroffset, root_cache->usersize, + root_cache->ctor, root_cache); + if (IS_ERR(s)) { + pr_info("Unable to create child kmem cache %s. Err %ld", + name, PTR_ERR(s)); + kfree(name); + s = NULL; + goto out; + } + + if (mm_asi_enabled(mm)) + s->local_cache_info.mm = mm; + + smp_store_release(cache_ptr, s); +out: + mutex_unlock(&slab_mutex); + + return s; +} + +struct kmem_cache *get_local_kmem_cache(struct kmem_cache *s, + struct mm_struct *mm, gfp_t flags) +{ + struct kmem_cache *local_cache = NULL; + + if (!(s->flags & SLAB_LOCAL_NONSENSITIVE) || !is_root_cache(s)) + return s; + + if (mm_asi_enabled(mm)) { + struct kmem_cache **caches; + int id = READ_ONCE(s->local_cache_info.cache_id); + uint array_size = smp_load_acquire( + &mm->local_slab_caches_array_size); + + if (id >= 0 && array_size > id) { + rcu_read_lock(); + caches = rcu_dereference(mm->local_slab_caches); + local_cache = smp_load_acquire(&caches[id]); + rcu_read_unlock(); + } + } else { + local_cache = + smp_load_acquire(&s->local_cache_info.sensitive_cache); + } + + if (!local_cache) + local_cache = create_local_kmem_cache(s, mm, flags); + + return local_cache; +} + +void free_local_slab_caches(struct mm_struct *mm) +{ + uint i; + struct kmem_cache **caches = + rcu_dereference_protected(mm->local_slab_caches, + atomic_read(&mm->mm_count) == 0); + + if (!caches) + return; + + cpus_read_lock(); + mutex_lock(&slab_mutex); + + for (i = 0; i < mm->local_slab_caches_array_size; i++) + if (caches[i]) + WARN_ON(shutdown_cache(caches[i])); + + mutex_unlock(&slab_mutex); + cpus_read_unlock(); + + kfree(caches); +} + +int kmem_cache_precreate_local(struct kmem_cache *s) +{ + VM_BUG_ON(!is_root_cache(s)); + VM_BUG_ON(!in_task()); + might_sleep(); + + return get_local_kmem_cache(s, current->mm, GFP_KERNEL) ? 0 : -ENOMEM; +} +EXPORT_SYMBOL(kmem_cache_precreate_local); + +int kmem_cache_precreate_local_kmalloc(size_t size, gfp_t flags) +{ + struct kmem_cache *s = kmalloc_slab(size, + flags | __GFP_LOCAL_NONSENSITIVE); + + if (ZERO_OR_NULL_PTR(s)) + return 0; + + return kmem_cache_precreate_local(s); +} +EXPORT_SYMBOL(kmem_cache_precreate_local_kmalloc); + +#endif From patchwork Wed Feb 23 05:22:01 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756430 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id F3D9EC433EF for ; Wed, 23 Feb 2022 05:26:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238283AbiBWF0w (ORCPT ); Wed, 23 Feb 2022 00:26:52 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58122 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238220AbiBWFZx (ORCPT ); Wed, 23 Feb 2022 00:25:53 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4D5286E2A8 for ; Tue, 22 Feb 2022 21:24:53 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id k10-20020a056902070a00b0062469b00335so10829507ybt.14 for ; Tue, 22 Feb 2022 21:24:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=qiOY3I7MTzDOyydGuOsoL3kibVHTWqt/T8NonRDSdjI=; b=gHD/ZzZ2afDsPzQ43uzTfHDeNudIA9BOq9xrr9L14+jP+dH61iVLZ9hzKNk6oCuBOI frJ4Y/pMW9lHMFuHIDT5LntquphVA/hdUftBkeKzRimB/OZsUvqrqt+OTldNL/5M2dvF +PIjZDytqBI9qNUICzwPtZ9WZreG3la5anScSUKXP0fumh99WG+FQaCeyfbwMohmDN9Z TqQY1Ne53dTgZlticBLJ/nLFoMZJ6npZslQ1FhKLNnIG+gBpH+SHHrgjBmREZgI8pcKn GtFQ6xyENvmtRWa1WOq2159X/570g6HXWtqzHRNADNoxBQZmRvsc5EkR5xI3zImo/8Xk CC5Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=qiOY3I7MTzDOyydGuOsoL3kibVHTWqt/T8NonRDSdjI=; b=5K7kV1L/1norlOP74jTBaUkGBE1DEVX1o5NYI354uGjhzxMvFFRtSpEcCVmey/+nXf /IhTt2DMmOJjqh1uGTM/clhkb+yYNT11Utyr5UcCFplo5lBPaCKrlujFVJ7O+p7R/Ia/ JLOxRgMzwGx/SojnEjHpvQ7GnvsxUD2OWZ40l8HIdsnVhrLp/uKdouJGQEAlGgwHGdSB QBev9P4UtrAE5oyncPPWBRrJ8yvgn6mxDrQ2+xwpqTvNBl7rquGFyNuacvMXx3dMxxoR PpO1ROiQfEsWnvOmLHMvsRSjBAFrbY08VTZ9fxh/qLWOySS6I7Kib1/SloW6DXTZaQkO sEtg== X-Gm-Message-State: AOAM533Qnwaf3NBH+Iyn11sGaFhUgmH5m0owOZD3JeH3X6fOxHrJlpYA aWg4WsFoRbyMKzVEtHhX26EVjgRp6Gm2 X-Google-Smtp-Source: ABdhPJz0RUH+JeSNvxcFS2t+PaXJsk2Yx25dJd/bxX+okavoC9VlrjteMlukvkj4zkv9eNKI+iSQhOk6jOMd X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a25:bad2:0:b0:620:fe28:ff53 with SMTP id a18-20020a25bad2000000b00620fe28ff53mr26732357ybk.340.1645593881600; Tue, 22 Feb 2022 21:24:41 -0800 (PST) Date: Tue, 22 Feb 2022 21:22:01 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-26-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 25/47] mm: asi: Avoid warning from NMI userspace accesses in ASI context From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org nmi_uaccess_okay() emits a warning if current CR3 != mm->pgd. Limit the warning to only when ASI is not active. Signed-off-by: Junaid Shahid --- arch/x86/mm/tlb.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 25bee959d1d3..628f1cd904ac 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -1292,7 +1292,8 @@ bool nmi_uaccess_okay(void) if (loaded_mm != current_mm) return false; - VM_WARN_ON_ONCE(current_mm->pgd != __va(read_cr3_pa())); + VM_WARN_ON_ONCE(current_mm->pgd != __va(read_cr3_pa()) && + !is_asi_active()); return true; } From patchwork Wed Feb 23 05:22:02 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756432 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 11082C433FE for ; Wed, 23 Feb 2022 05:26:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238187AbiBWF05 (ORCPT ); Wed, 23 Feb 2022 00:26:57 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58186 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238227AbiBWF0E (ORCPT ); Wed, 23 Feb 2022 00:26:04 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BDC6B6E2B3 for ; Tue, 22 Feb 2022 21:24:53 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id k7-20020a255607000000b00621afc793b8so26767959ybb.1 for ; Tue, 22 Feb 2022 21:24:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=mHknd275MnGb/PaPBY3xJHVlcaSr+gb2g5yi23zkCMQ=; b=L60sUeBO3icqMJb7NMSmlVXjmIABeVNLCCtZzlOUINB+XKCgvwoL14gcA1/NI1wk1X /ltNckj3/3p7zfm1t1rOui84yRyyoV+XIZUZcTPfV3dYES6ZKCv3iaEJAriiM57Y+pJ2 8u/ZPuM69uvOpnXWpYYvEAAwrxzB6ihF2OY0/HqTUANIyhOPp6jmVN945B9gOsa+htR7 ghj+eGPzbx50ZpS/8LrSXhCGcEICsrhvK12TqRAuRKoUTOubcXhnkqoaG145HMstky5e ZFHHetF4BUFaO5ULQYQ3U7cD5n4sgmCCalGk7gwPkRzwbBni9d6kLYQ28QmTUA8pH14D Kjrg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=mHknd275MnGb/PaPBY3xJHVlcaSr+gb2g5yi23zkCMQ=; b=R85UP7e8f88+daYHd4S4ia67i8Jml/wcbTGXLlHozVExqzJwLo8nqZgaHNGQQoSHGo IWzSSnImlK98r+kYVv0JGHZ/Zm75eArY9kY/ch2y/Q3WfaVNbDflm8MBD8r+0o2+V8JQ u2vkyEKF2xijpIYf+a90XFQaGFTOu8zL3oaP7uiDk7jw4awgj2tWf5TXem8VlANAFpe4 fV2hwoVfMidONLwiTa5fLXy1DxdLHJRDNxoI/+izhnyD/1FC9ey9gHM+3OoNiynSe4XM zKx8MC2UT9Vxtk8brrQEObInQLEaxqdrGimUU2HbzbBQGadwM9c0k6dOGfQ9RnRVqNkR OAlA== X-Gm-Message-State: AOAM530/zoRL9Mehb8B9Di79NPOTqyajAt2mwrD6WvHZFtWhEtuCEwNh zjvVLE/Zcjtug1kbQlm2TVW4yzhYolUS X-Google-Smtp-Source: ABdhPJw5GHEn07Q8rEslG5JDYumsMKlC/t9Gjkx79Rz/pi8IkAfQS2IAOFNhhCDkd7GvnzpI+G+qB+zNYrIt X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a81:8414:0:b0:2d0:fdd8:f7e2 with SMTP id u20-20020a818414000000b002d0fdd8f7e2mr27082928ywf.156.1645593883815; Tue, 22 Feb 2022 21:24:43 -0800 (PST) Date: Tue, 22 Feb 2022 21:22:02 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-27-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 26/47] mm: asi: Use separate PCIDs for restricted address spaces From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Each restricted address space is assigned a separate PCID. Since currently only one ASI instance per-class exists for a given process, the PCID is just derived from the class index. This commit only sets the appropriate PCID when switching CR3, but does not set the NOFLUSH bit. That will be done by later patches. Signed-off-by: Junaid Shahid --- arch/x86/include/asm/asi.h | 3 ++- arch/x86/include/asm/tlbflush.h | 3 +++ arch/x86/mm/asi.c | 6 +++-- arch/x86/mm/tlb.c | 45 ++++++++++++++++++++++++++++++--- 4 files changed, 50 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h index 062ccac07fd9..aaa0d0bdbf59 100644 --- a/arch/x86/include/asm/asi.h +++ b/arch/x86/include/asm/asi.h @@ -40,7 +40,8 @@ struct asi { pgd_t *pgd; struct asi_class *class; struct mm_struct *mm; - int64_t asi_ref_count; + u16 pcid_index; + int64_t asi_ref_count; }; DECLARE_PER_CPU_ALIGNED(struct asi_state, asi_cpu_state); diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 3c43ad46c14a..f9ec5e67e361 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -260,6 +260,9 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch); unsigned long build_cr3(pgd_t *pgd, u16 asid); +unsigned long build_cr3_pcid(pgd_t *pgd, u16 pcid, bool noflush); + +u16 asi_pcid(struct asi *asi, u16 asid); #endif /* !MODULE */ diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c index 6b9a0f5ab391..dbfea3dc4bb1 100644 --- a/arch/x86/mm/asi.c +++ b/arch/x86/mm/asi.c @@ -335,6 +335,7 @@ int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi) asi->class = &asi_class[asi_index]; asi->mm = mm; + asi->pcid_index = asi_index; if (asi->class->flags & ASI_MAP_STANDARD_NONSENSITIVE) { uint i; @@ -386,6 +387,7 @@ EXPORT_SYMBOL_GPL(asi_destroy); void __asi_enter(void) { u64 asi_cr3; + u16 pcid; struct asi *target = this_cpu_read(asi_cpu_state.target_asi); VM_BUG_ON(preemptible()); @@ -399,8 +401,8 @@ void __asi_enter(void) this_cpu_write(asi_cpu_state.curr_asi, target); - asi_cr3 = build_cr3(target->pgd, - this_cpu_read(cpu_tlbstate.loaded_mm_asid)); + pcid = asi_pcid(target, this_cpu_read(cpu_tlbstate.loaded_mm_asid)); + asi_cr3 = build_cr3_pcid(target->pgd, pcid, false); write_cr3(asi_cr3); if (target->class->ops.post_asi_enter) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 628f1cd904ac..312b9c185a55 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -97,7 +97,12 @@ # define PTI_CONSUMED_PCID_BITS 0 #endif -#define CR3_AVAIL_PCID_BITS (X86_CR3_PCID_BITS - PTI_CONSUMED_PCID_BITS) +#define ASI_CONSUMED_PCID_BITS ASI_MAX_NUM_ORDER +#define ASI_PCID_BITS_SHIFT CR3_AVAIL_PCID_BITS +#define CR3_AVAIL_PCID_BITS (X86_CR3_PCID_BITS - PTI_CONSUMED_PCID_BITS - \ + ASI_CONSUMED_PCID_BITS) + +static_assert(TLB_NR_DYN_ASIDS < BIT(CR3_AVAIL_PCID_BITS)); /* * ASIDs are zero-based: 0->MAX_AVAIL_ASID are valid. -1 below to account @@ -154,6 +159,34 @@ static inline u16 user_pcid(u16 asid) return ret; } +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + +u16 asi_pcid(struct asi *asi, u16 asid) +{ + return kern_pcid(asid) | (asi->pcid_index << ASI_PCID_BITS_SHIFT); +} + +#else /* CONFIG_ADDRESS_SPACE_ISOLATION */ + +u16 asi_pcid(struct asi *asi, u16 asid) +{ + return kern_pcid(asid); +} + +#endif /* CONFIG_ADDRESS_SPACE_ISOLATION */ + +unsigned long build_cr3_pcid(pgd_t *pgd, u16 pcid, bool noflush) +{ + u64 noflush_bit = 0; + + if (!static_cpu_has(X86_FEATURE_PCID)) + pcid = 0; + else if (noflush) + noflush_bit = CR3_NOFLUSH; + + return __sme_pa(pgd) | pcid | noflush_bit; +} + inline unsigned long build_cr3(pgd_t *pgd, u16 asid) { if (static_cpu_has(X86_FEATURE_PCID)) { @@ -1078,13 +1111,17 @@ unsigned long __get_current_cr3_fast(void) pgd_t *pgd; u16 asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid); struct asi *asi = asi_get_current(); + u16 pcid; - if (asi) + if (asi) { pgd = asi_pgd(asi); - else + pcid = asi_pcid(asi, asid); + } else { pgd = this_cpu_read(cpu_tlbstate.loaded_mm)->pgd; + pcid = kern_pcid(asid); + } - cr3 = build_cr3(pgd, asid); + cr3 = build_cr3_pcid(pgd, pcid, false); /* For now, be very restrictive about when this can be called. */ VM_WARN_ON(in_nmi() || preemptible()); From patchwork Wed Feb 23 05:22:03 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756436 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 66C12C433EF for ; Wed, 23 Feb 2022 05:26:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238307AbiBWF1F (ORCPT ); Wed, 23 Feb 2022 00:27:05 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57626 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238285AbiBWF0T (ORCPT ); Wed, 23 Feb 2022 00:26:19 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CA6126E2BF for ; Tue, 22 Feb 2022 21:24:55 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-2d726bd83a2so91086467b3.20 for ; Tue, 22 Feb 2022 21:24:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=uZNDncAP+HAcrLlxbbBV0jFwf5VHCLxBgZuEl9jtIsw=; b=r7+qr6EGJVpdhQWG8PE8sNYbh0H6kTLpdH6ctOr0wPyBw/Z56OW3u31LYarb4QOnRX rF4lg3l0IVEqPlgJWaO9d3dWxAEdyJe1HcTUm1KWff1m3xuyDOy9VVB9sTaWZhhffwst nktPdcdn5pMY1JLUsndyvZTdi+apZc+M8CANZ+FleFcjiBh1KMmN3V5cV4UkdqGp6dj7 eto63Nt1iYZsRhYWTV9r7EgsApuw+VuQRiPkO/uEZytDeujtWN47qFOJnrbnHuYu8/o4 722Zq0PbXoAaWsxp0efHDuBM6HhxyWCGqP/XygpW0eZmFRIaDvse+6/DxrUVYKJYRCUo G3aA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=uZNDncAP+HAcrLlxbbBV0jFwf5VHCLxBgZuEl9jtIsw=; b=l8NJ0z4VlL3CK0dneZUscdZq9IZ1RkpFijDSoZwfaEAFJ96EvsIkoC1eRBY7772RUf BlwvNSA+74Uj1PU5vUFvzTrDn2U2RrWl/oEWZYeeS30bnUpDloSAzY3xmPTUUqv1leW+ LknvRrHBv/QTnMUsjnMPHA4+UqcuUh7s2zqIoQTiH0nNIZLZMIHOuIYE+iYJcve5R0YN oKF3LzWaJ5nLTI0+dZphxQ/8GwFo+mlO3O4I6EnGqVq2aPwa6+DW2qfFLVjXdlsCN+YY 0eI0ystS7ZV7ot97mXMTfMmUMemXGDfpTHnz2iq1o3UQ7AljFR1kKFNehLtbXRxnCjxz nLHw== X-Gm-Message-State: AOAM532vewLKUnrP/ySXFnMRoQuq0ILriyBNe9S+JYh5DHGvVqystMSG LUiw1g7Bb7BpO2AoJbfq8inKo5rqAevr X-Google-Smtp-Source: ABdhPJyGbSHeEEC3fevrTXJ9nvduL9Idz1OuetENZ+6YGEpuvdg8wlXwAyrhsHfA5UVeantdGlz2RTX/R7Xv X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a25:6993:0:b0:624:55af:336c with SMTP id e141-20020a256993000000b0062455af336cmr19351145ybc.412.1645593885940; Tue, 22 Feb 2022 21:24:45 -0800 (PST) Date: Tue, 22 Feb 2022 21:22:03 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-28-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 27/47] mm: asi: Avoid TLB flushes during ASI CR3 switches when possible From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org The TLB flush functions are modified to flush the ASI PCIDs in addition to the unrestricted kernel PCID and the KPTI PCID. Some tracking is also added to figure out when the TLB state for ASI PCIDs is out-of-date (e.g. due to lack of INVPCID support), and ASI Enter/Exit use this information to skip a TLB flush during the CR3 switch when the TLB is already up-to-date. Signed-off-by: Junaid Shahid --- arch/x86/include/asm/asi.h | 11 ++- arch/x86/include/asm/tlbflush.h | 47 ++++++++++ arch/x86/mm/asi.c | 38 +++++++- arch/x86/mm/tlb.c | 152 ++++++++++++++++++++++++++++++-- 4 files changed, 234 insertions(+), 14 deletions(-) diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h index aaa0d0bdbf59..1a77917c79c7 100644 --- a/arch/x86/include/asm/asi.h +++ b/arch/x86/include/asm/asi.h @@ -126,11 +126,18 @@ static inline void asi_intr_exit(void) if (static_cpu_has(X86_FEATURE_ASI)) { barrier(); - if (--current->thread.intr_nest_depth == 0) + if (--current->thread.intr_nest_depth == 0) { + barrier(); __asi_enter(); + } } } +static inline int asi_intr_nest_depth(void) +{ + return current->thread.intr_nest_depth; +} + #define INIT_MM_ASI(init_mm) \ .asi = { \ [0] = { \ @@ -150,6 +157,8 @@ static inline void asi_intr_enter(void) { } static inline void asi_intr_exit(void) { } +static inline int asi_intr_nest_depth(void) { return 0; } + static inline void asi_init_thread_state(struct thread_struct *thread) { } static inline pgd_t *asi_pgd(struct asi *asi) { return NULL; } diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index f9ec5e67e361..295bebdb4395 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -12,6 +12,7 @@ #include #include #include +#include void __flush_tlb_all(void); @@ -59,9 +60,20 @@ static inline void cr4_clear_bits(unsigned long mask) */ #define TLB_NR_DYN_ASIDS 6 +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + +struct asi_tlb_context { + bool flush_pending; +}; + +#endif + struct tlb_context { u64 ctx_id; u64 tlb_gen; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + struct asi_tlb_context asi_context[ASI_MAX_NUM]; +#endif }; struct tlb_state { @@ -100,6 +112,10 @@ struct tlb_state { */ bool invalidate_other; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + /* If set, ASI Exit needs to do a TLB flush during the CR3 switch */ + bool kern_pcid_needs_flush; +#endif /* * Mask that contains TLB_NR_DYN_ASIDS+1 bits to indicate * the corresponding user PCID needs a flush next time we @@ -262,8 +278,39 @@ extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch); unsigned long build_cr3(pgd_t *pgd, u16 asid); unsigned long build_cr3_pcid(pgd_t *pgd, u16 pcid, bool noflush); +u16 kern_pcid(u16 asid); u16 asi_pcid(struct asi *asi, u16 asid); +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + +static inline bool *__asi_tlb_flush_pending(struct asi *asi) +{ + struct tlb_state *tlb_state; + struct tlb_context *tlb_context; + + tlb_state = this_cpu_ptr(&cpu_tlbstate); + tlb_context = &tlb_state->ctxs[tlb_state->loaded_mm_asid]; + return &tlb_context->asi_context[asi->pcid_index].flush_pending; +} + +static inline bool asi_get_and_clear_tlb_flush_pending(struct asi *asi) +{ + bool *tlb_flush_pending_ptr = __asi_tlb_flush_pending(asi); + bool tlb_flush_pending = READ_ONCE(*tlb_flush_pending_ptr); + + if (tlb_flush_pending) + WRITE_ONCE(*tlb_flush_pending_ptr, false); + + return tlb_flush_pending; +} + +static inline void asi_clear_pending_tlb_flush(struct asi *asi) +{ + WRITE_ONCE(*__asi_tlb_flush_pending(asi), false); +} + +#endif /* CONFIG_ADDRESS_SPACE_ISOLATION */ + #endif /* !MODULE */ #endif /* _ASM_X86_TLBFLUSH_H */ diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c index dbfea3dc4bb1..17b8e6e60312 100644 --- a/arch/x86/mm/asi.c +++ b/arch/x86/mm/asi.c @@ -388,6 +388,7 @@ void __asi_enter(void) { u64 asi_cr3; u16 pcid; + bool need_flush = false; struct asi *target = this_cpu_read(asi_cpu_state.target_asi); VM_BUG_ON(preemptible()); @@ -401,8 +402,18 @@ void __asi_enter(void) this_cpu_write(asi_cpu_state.curr_asi, target); + if (static_cpu_has(X86_FEATURE_PCID)) + need_flush = asi_get_and_clear_tlb_flush_pending(target); + + /* + * It is possible that we may get a TLB flush IPI after + * already reading need_flush, in which case we won't do the + * flush below. However, in that case the interrupt epilog + * will also call __asi_enter(), which will do the flush. + */ + pcid = asi_pcid(target, this_cpu_read(cpu_tlbstate.loaded_mm_asid)); - asi_cr3 = build_cr3_pcid(target->pgd, pcid, false); + asi_cr3 = build_cr3_pcid(target->pgd, pcid, !need_flush); write_cr3(asi_cr3); if (target->class->ops.post_asi_enter) @@ -437,12 +448,31 @@ void asi_exit(void) asi = this_cpu_read(asi_cpu_state.curr_asi); if (asi) { + bool need_flush = false; + if (asi->class->ops.pre_asi_exit) asi->class->ops.pre_asi_exit(); - unrestricted_cr3 = - build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd, - this_cpu_read(cpu_tlbstate.loaded_mm_asid)); + if (static_cpu_has(X86_FEATURE_PCID) && + !static_cpu_has(X86_FEATURE_INVPCID_SINGLE)) { + need_flush = this_cpu_read( + cpu_tlbstate.kern_pcid_needs_flush); + this_cpu_write(cpu_tlbstate.kern_pcid_needs_flush, + false); + } + + /* + * It is possible that we may get a TLB flush IPI after + * already reading need_flush. However, in that case the IPI + * will not set flush_pending for the unrestricted address + * space, as that is done by flush_tlb_one_user() only if + * asi_intr_nest_depth() is 0. + */ + + unrestricted_cr3 = build_cr3_pcid( + this_cpu_read(cpu_tlbstate.loaded_mm)->pgd, + kern_pcid(this_cpu_read(cpu_tlbstate.loaded_mm_asid)), + !need_flush); write_cr3(unrestricted_cr3); this_cpu_write(asi_cpu_state.curr_asi, NULL); diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 312b9c185a55..5c9681df3a16 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -114,7 +114,7 @@ static_assert(TLB_NR_DYN_ASIDS < BIT(CR3_AVAIL_PCID_BITS)); /* * Given @asid, compute kPCID */ -static inline u16 kern_pcid(u16 asid) +inline u16 kern_pcid(u16 asid) { VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE); @@ -166,6 +166,60 @@ u16 asi_pcid(struct asi *asi, u16 asid) return kern_pcid(asid) | (asi->pcid_index << ASI_PCID_BITS_SHIFT); } +static void invalidate_kern_pcid(void) +{ + this_cpu_write(cpu_tlbstate.kern_pcid_needs_flush, true); +} + +static void invalidate_asi_pcid(struct asi *asi, u16 asid) +{ + uint i; + struct asi_tlb_context *asi_tlb_context; + + if (!static_cpu_has(X86_FEATURE_ASI) || + !static_cpu_has(X86_FEATURE_PCID)) + return; + + asi_tlb_context = this_cpu_ptr(cpu_tlbstate.ctxs[asid].asi_context); + + if (asi) + asi_tlb_context[asi->pcid_index].flush_pending = true; + else + for (i = 1; i < ASI_MAX_NUM; i++) + asi_tlb_context[i].flush_pending = true; +} + +static void flush_asi_pcid(struct asi *asi) +{ + u16 asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid); + /* + * The flag should be cleared before the INVPCID, to avoid clearing it + * in case an interrupt/exception sets it again after the INVPCID. + */ + asi_clear_pending_tlb_flush(asi); + invpcid_flush_single_context(asi_pcid(asi, asid)); +} + +static void __flush_tlb_one_asi(struct asi *asi, u16 asid, size_t addr) +{ + if (!static_cpu_has(X86_FEATURE_ASI)) + return; + + if (!static_cpu_has(X86_FEATURE_INVPCID_SINGLE)) { + invalidate_asi_pcid(asi, asid); + } else if (asi) { + invpcid_flush_one(asi_pcid(asi, asid), addr); + } else { + uint i; + struct mm_struct *mm = this_cpu_read(cpu_tlbstate.loaded_mm); + + for (i = 1; i < ASI_MAX_NUM; i++) + if (mm->asi[i].pgd) + invpcid_flush_one(asi_pcid(&mm->asi[i], asid), + addr); + } +} + #else /* CONFIG_ADDRESS_SPACE_ISOLATION */ u16 asi_pcid(struct asi *asi, u16 asid) @@ -173,6 +227,11 @@ u16 asi_pcid(struct asi *asi, u16 asid) return kern_pcid(asid); } +static inline void invalidate_kern_pcid(void) { } +static inline void invalidate_asi_pcid(struct asi *asi, u16 asid) { } +static inline void flush_asi_pcid(struct asi *asi) { } +static inline void __flush_tlb_one_asi(struct asi *asi, u16 asid, size_t addr) { } + #endif /* CONFIG_ADDRESS_SPACE_ISOLATION */ unsigned long build_cr3_pcid(pgd_t *pgd, u16 pcid, bool noflush) @@ -223,7 +282,8 @@ static void clear_asid_other(void) * This is only expected to be set if we have disabled * kernel _PAGE_GLOBAL pages. */ - if (!static_cpu_has(X86_FEATURE_PTI)) { + if (!static_cpu_has(X86_FEATURE_PTI) && + !cpu_feature_enabled(X86_FEATURE_ASI)) { WARN_ON_ONCE(1); return; } @@ -313,6 +373,7 @@ static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush) if (need_flush) { invalidate_user_asid(new_asid); + invalidate_asi_pcid(NULL, new_asid); new_mm_cr3 = build_cr3(pgdir, new_asid); } else { new_mm_cr3 = build_cr3_noflush(pgdir, new_asid); @@ -741,11 +802,17 @@ void initialize_tlbstate_and_flush(void) this_cpu_write(cpu_tlbstate.next_asid, 1); this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, mm->context.ctx_id); this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, tlb_gen); + invalidate_asi_pcid(NULL, 0); for (i = 1; i < TLB_NR_DYN_ASIDS; i++) this_cpu_write(cpu_tlbstate.ctxs[i].ctx_id, 0); } +static inline void invlpg(unsigned long addr) +{ + asm volatile("invlpg (%0)" ::"r"(addr) : "memory"); +} + /* * flush_tlb_func()'s memory ordering requirement is that any * TLB fills that happen after we flush the TLB are ordered after we @@ -967,7 +1034,8 @@ void flush_tlb_multi(const struct cpumask *cpumask, * least 95%) of allocations, and is small enough that we are * confident it will not cause too much overhead. Each single * flush is about 100 ns, so this caps the maximum overhead at - * _about_ 3,000 ns. + * _about_ 3,000 ns (plus upto an additional ~3000 ns for each + * ASI instance, or for KPTI). * * This is in units of pages. */ @@ -1157,7 +1225,8 @@ void flush_tlb_one_kernel(unsigned long addr) */ flush_tlb_one_user(addr); - if (!static_cpu_has(X86_FEATURE_PTI)) + if (!static_cpu_has(X86_FEATURE_PTI) && + !cpu_feature_enabled(X86_FEATURE_ASI)) return; /* @@ -1174,9 +1243,45 @@ void flush_tlb_one_kernel(unsigned long addr) */ STATIC_NOPV void native_flush_tlb_one_user(unsigned long addr) { - u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid); + u16 loaded_mm_asid; - asm volatile("invlpg (%0)" ::"r" (addr) : "memory"); + if (!static_cpu_has(X86_FEATURE_PCID)) { + invlpg(addr); + return; + } + + loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid); + + /* + * If we don't have INVPCID support, then we do an ASI Exit so that + * the invlpg happens in the unrestricted address space, and we + * invalidate the ASI PCID so that it is flushed at the next ASI Enter. + * + * But if a valid target ASI is set, then an ASI Exit can be ephemeral + * due to interrupts/exceptions/NMIs (except if we are already inside + * one), so we just invalidate both the ASI and the unrestricted kernel + * PCIDs and let the invlpg flush whichever happens to be the current + * address space. This is a bit more wasteful, but this scenario is not + * actually expected to occur with the current usage of ASI, and is + * handled here just for completeness. (If we wanted to optimize this, + * we could manipulate the intr_nest_depth to guarantee that an ASI + * Exit is not ephemeral). + */ + if (!static_cpu_has(X86_FEATURE_INVPCID_SINGLE)) { + if (unlikely(!asi_is_target_unrestricted()) && + asi_intr_nest_depth() == 0) + invalidate_kern_pcid(); + else + asi_exit(); + } + + /* Flush the unrestricted kernel address space */ + if (!is_asi_active()) + invlpg(addr); + else + invpcid_flush_one(kern_pcid(loaded_mm_asid), addr); + + __flush_tlb_one_asi(NULL, loaded_mm_asid, addr); if (!static_cpu_has(X86_FEATURE_PTI)) return; @@ -1235,6 +1340,9 @@ STATIC_NOPV void native_flush_tlb_global(void) */ STATIC_NOPV void native_flush_tlb_local(void) { + struct asi *asi; + u16 asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid); + /* * Preemption or interrupts must be disabled to protect the access * to the per CPU variable and to prevent being preempted between @@ -1242,10 +1350,36 @@ STATIC_NOPV void native_flush_tlb_local(void) */ WARN_ON_ONCE(preemptible()); - invalidate_user_asid(this_cpu_read(cpu_tlbstate.loaded_mm_asid)); + /* + * If we don't have INVPCID support, then we have to use + * write_cr3(read_cr3()). However, that is not safe when ASI is active, + * as an interrupt/exception/NMI could cause an ASI Exit in the middle + * and change CR3. So we trigger an ASI Exit beforehand. But if a valid + * target ASI is set, then an ASI Exit can also be ephemeral due to + * interrupts (except if we are already inside one), and thus we have to + * fallback to a global TLB flush. + */ + if (!static_cpu_has(X86_FEATURE_INVPCID_SINGLE)) { + if (unlikely(!asi_is_target_unrestricted()) && + asi_intr_nest_depth() == 0) { + native_flush_tlb_global(); + return; + } + asi_exit(); + } - /* If current->mm == NULL then the read_cr3() "borrows" an mm */ - native_write_cr3(__native_read_cr3()); + invalidate_user_asid(asid); + invalidate_asi_pcid(NULL, asid); + + asi = asi_get_current(); + + if (!asi) { + /* If current->mm == NULL then the read_cr3() "borrows" an mm */ + native_write_cr3(__native_read_cr3()); + } else { + invpcid_flush_single_context(kern_pcid(asid)); + flush_asi_pcid(asi); + } } void flush_tlb_local(void) From patchwork Wed Feb 23 05:22:04 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756435 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 843CAC433EF for ; Wed, 23 Feb 2022 05:26:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238244AbiBWF1C (ORCPT ); Wed, 23 Feb 2022 00:27:02 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57960 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238351AbiBWF0U (ORCPT ); Wed, 23 Feb 2022 00:26:20 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8CB366E4EE for ; Tue, 22 Feb 2022 21:25:01 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-2d6180e0ab4so162341737b3.2 for ; Tue, 22 Feb 2022 21:25:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=ht+B2vhllT9TUcS9YCbehkBKQfQF+JUOyLFYYS9ZTII=; b=X4LjelXhHrsgUAmOKDupgWtkMo3qDIfheiClzbfmM97+mtJS0Ry/l7ad7kEFvNmAMO ru8xaZ7F2Wij3ltuM/YwpEuAwo9u28wNABujo/9gA7WhQcealUnUhGS4exGNNI00mM/h QXyufTreNJLTmpGLik/8e62TY2RamcG4YVt1a9kq7jRUpJ5gpqeZZz9Udb++OBULoX7j 8m+I6tqS0xRBkRNDUPy1xvHpGNE8YXXtkcOVcVJNJbGCsF4uJmSxV3FYbdH5JUcqjBWM ug6RC3+13tjcavk58yKUhJSjRMK2SZqloPpTOfJuLf4aHFrFV1j6/rXU3Dwv5NYwIvsx Wd6g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=ht+B2vhllT9TUcS9YCbehkBKQfQF+JUOyLFYYS9ZTII=; b=sdENzewPF/kZRKBNhE/vij5tpUY55MRyEcfXv2TL69xQh/sBzsLDKYi0xWQawWryhy OrjwMBlS2xQmYY+3ZMitSsLD0VmUCXtoSDqoM8eELajHe2rULP83SZ3W7D8mEzDVWOIH xiRcXH1OmWpB67xQRk3BBSu2rDFvYxu9kkvdGxF7qjl/etCaxpKUOlvNJTRlSVAt0HEC MpcPAy8QNY9iBqUuEKnDt8vie5boiMYvmpBBak0pyw3hOuwZAoBQTtN1AukH+ter87jF jqzY29IwKGYZvGx9nZuJRd0OVr6tw+G82xBWlMkGN8QyFubjm/fk+GNnV7Y/mF1fbuh9 A0RA== X-Gm-Message-State: AOAM531r9WhJPg5EMqQ+eVveUSjGglU2uZmkr0rpvLIruLbbA8lPJ9Xc Py3hmgbT38opVcgUgJtBakrLjWN/25sN X-Google-Smtp-Source: ABdhPJzjFJlAylJJF1esEWWn/fBYg6FPQgSZQ8x6J/wS+NLWdoHP+2b1OSKZqRs0ZEPrSWXIKV7/9dmH6JQ/ X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a25:6b4d:0:b0:624:7295:42ee with SMTP id o13-20020a256b4d000000b00624729542eemr15381999ybm.290.1645593888119; Tue, 22 Feb 2022 21:24:48 -0800 (PST) Date: Tue, 22 Feb 2022 21:22:04 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-29-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 28/47] mm: asi: Avoid TLB flush IPIs to CPUs not in ASI context From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Most CPUs will not be running in a restricted ASI address space at any given time. So when we need to do an ASI TLB flush, we can skip those CPUs and let them do a flush at the time of the next ASI Enter. Furthermore, for flushes related to local non-sensitive memory, we can restrict the CPU set even further to those CPUs that have that specific mm_struct loaded. Signed-off-by: Junaid Shahid --- arch/x86/include/asm/asi.h | 9 +- arch/x86/include/asm/tlbflush.h | 47 +++---- arch/x86/mm/asi.c | 73 +++++++++-- arch/x86/mm/tlb.c | 209 ++++++++++++++++++++++++++++++-- 4 files changed, 282 insertions(+), 56 deletions(-) diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h index 1a77917c79c7..35421356584b 100644 --- a/arch/x86/include/asm/asi.h +++ b/arch/x86/include/asm/asi.h @@ -41,6 +41,8 @@ struct asi { struct asi_class *class; struct mm_struct *mm; u16 pcid_index; + atomic64_t *tlb_gen; + atomic64_t __tlb_gen; int64_t asi_ref_count; }; @@ -138,11 +140,16 @@ static inline int asi_intr_nest_depth(void) return current->thread.intr_nest_depth; } +void asi_get_latest_tlb_gens(struct asi *asi, u64 *latest_local_tlb_gen, + u64 *latest_global_tlb_gen); + #define INIT_MM_ASI(init_mm) \ .asi = { \ [0] = { \ .pgd = asi_global_nonsensitive_pgd, \ - .mm = &init_mm \ + .mm = &init_mm, \ + .__tlb_gen = ATOMIC64_INIT(1), \ + .tlb_gen = &init_mm.asi[0].__tlb_gen \ } \ }, diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 295bebdb4395..85315d1d2d70 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -63,7 +63,8 @@ static inline void cr4_clear_bits(unsigned long mask) #ifdef CONFIG_ADDRESS_SPACE_ISOLATION struct asi_tlb_context { - bool flush_pending; + u64 local_tlb_gen; + u64 global_tlb_gen; }; #endif @@ -223,6 +224,20 @@ struct flush_tlb_info { unsigned int initiating_cpu; u8 stride_shift; u8 freed_tables; + +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + /* + * We can't use the mm pointer above, as there can be some cases where + * the mm is already freed. Of course, a flush wouldn't be necessary + * in that case, and we would know that when we compare the context ID. + * + * If U64_MAX, then a global flush would be done. + */ + u64 mm_context_id; + + /* If non-zero, flush only the ASI instance with this PCID index. */ + u16 asi_pcid_index; +#endif }; void flush_tlb_local(void); @@ -281,36 +296,6 @@ unsigned long build_cr3_pcid(pgd_t *pgd, u16 pcid, bool noflush); u16 kern_pcid(u16 asid); u16 asi_pcid(struct asi *asi, u16 asid); -#ifdef CONFIG_ADDRESS_SPACE_ISOLATION - -static inline bool *__asi_tlb_flush_pending(struct asi *asi) -{ - struct tlb_state *tlb_state; - struct tlb_context *tlb_context; - - tlb_state = this_cpu_ptr(&cpu_tlbstate); - tlb_context = &tlb_state->ctxs[tlb_state->loaded_mm_asid]; - return &tlb_context->asi_context[asi->pcid_index].flush_pending; -} - -static inline bool asi_get_and_clear_tlb_flush_pending(struct asi *asi) -{ - bool *tlb_flush_pending_ptr = __asi_tlb_flush_pending(asi); - bool tlb_flush_pending = READ_ONCE(*tlb_flush_pending_ptr); - - if (tlb_flush_pending) - WRITE_ONCE(*tlb_flush_pending_ptr, false); - - return tlb_flush_pending; -} - -static inline void asi_clear_pending_tlb_flush(struct asi *asi) -{ - WRITE_ONCE(*__asi_tlb_flush_pending(asi), false); -} - -#endif /* CONFIG_ADDRESS_SPACE_ISOLATION */ - #endif /* !MODULE */ #endif /* _ASM_X86_TLBFLUSH_H */ diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c index 17b8e6e60312..29c74b6d4262 100644 --- a/arch/x86/mm/asi.c +++ b/arch/x86/mm/asi.c @@ -355,6 +355,11 @@ int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi) for (i = pgd_index(VMALLOC_GLOBAL_NONSENSITIVE_START); i < PTRS_PER_PGD; i++) set_pgd(asi->pgd + i, asi_global_nonsensitive_pgd[i]); + + asi->tlb_gen = &mm->asi[0].__tlb_gen; + } else { + asi->tlb_gen = &asi->__tlb_gen; + atomic64_set(asi->tlb_gen, 1); } exit_unlock: @@ -384,11 +389,26 @@ void asi_destroy(struct asi *asi) } EXPORT_SYMBOL_GPL(asi_destroy); +void asi_get_latest_tlb_gens(struct asi *asi, u64 *latest_local_tlb_gen, + u64 *latest_global_tlb_gen) +{ + if (likely(asi->class->flags & ASI_MAP_STANDARD_NONSENSITIVE)) + *latest_global_tlb_gen = + atomic64_read(ASI_GLOBAL_NONSENSITIVE->tlb_gen); + else + *latest_global_tlb_gen = 0; + + *latest_local_tlb_gen = atomic64_read(asi->tlb_gen); +} + void __asi_enter(void) { u64 asi_cr3; u16 pcid; bool need_flush = false; + u64 latest_local_tlb_gen, latest_global_tlb_gen; + struct tlb_state *tlb_state; + struct asi_tlb_context *tlb_context; struct asi *target = this_cpu_read(asi_cpu_state.target_asi); VM_BUG_ON(preemptible()); @@ -397,17 +417,35 @@ void __asi_enter(void) if (!target || target == this_cpu_read(asi_cpu_state.curr_asi)) return; - VM_BUG_ON(this_cpu_read(cpu_tlbstate.loaded_mm) == - LOADED_MM_SWITCHING); + tlb_state = this_cpu_ptr(&cpu_tlbstate); + VM_BUG_ON(tlb_state->loaded_mm == LOADED_MM_SWITCHING); this_cpu_write(asi_cpu_state.curr_asi, target); - if (static_cpu_has(X86_FEATURE_PCID)) - need_flush = asi_get_and_clear_tlb_flush_pending(target); + if (static_cpu_has(X86_FEATURE_PCID)) { + /* + * curr_asi write has to happen before the asi->tlb_gen reads + * below. + * + * See comments in asi_flush_tlb_range(). + */ + smp_mb(); + + asi_get_latest_tlb_gens(target, &latest_local_tlb_gen, + &latest_global_tlb_gen); + + tlb_context = &tlb_state->ctxs[tlb_state->loaded_mm_asid] + .asi_context[target->pcid_index]; + + if (READ_ONCE(tlb_context->local_tlb_gen) < latest_local_tlb_gen + || READ_ONCE(tlb_context->global_tlb_gen) < + latest_global_tlb_gen) + need_flush = true; + } /* * It is possible that we may get a TLB flush IPI after - * already reading need_flush, in which case we won't do the + * already calculating need_flush, in which case we won't do the * flush below. However, in that case the interrupt epilog * will also call __asi_enter(), which will do the flush. */ @@ -416,6 +454,23 @@ void __asi_enter(void) asi_cr3 = build_cr3_pcid(target->pgd, pcid, !need_flush); write_cr3(asi_cr3); + if (static_cpu_has(X86_FEATURE_PCID)) { + /* + * There is a small possibility that an interrupt happened + * after the read of the latest_*_tlb_gen above and when + * that interrupt did an asi_enter() upon return, it read + * an even higher latest_*_tlb_gen and already updated the + * tlb_context->*tlb_gen accordingly. In that case, the + * following will move back the tlb_context->*tlb_gen. That + * isn't ideal, but it should not cause any correctness issues. + * We may just end up doing an unnecessary TLB flush on the next + * asi_enter(). If we really needed to avoid that, we could + * just do a cmpxchg, but it is likely not necessary. + */ + WRITE_ONCE(tlb_context->local_tlb_gen, latest_local_tlb_gen); + WRITE_ONCE(tlb_context->global_tlb_gen, latest_global_tlb_gen); + } + if (target->class->ops.post_asi_enter) target->class->ops.post_asi_enter(); } @@ -504,6 +559,8 @@ int asi_init_mm_state(struct mm_struct *mm) if (!mm->asi_enabled) return 0; + mm->asi[0].tlb_gen = &mm->asi[0].__tlb_gen; + atomic64_set(mm->asi[0].tlb_gen, 1); mm->asi[0].mm = mm; mm->asi[0].pgd = (pgd_t *)__get_free_page(GFP_PGTABLE_USER); if (!mm->asi[0].pgd) @@ -718,12 +775,6 @@ void asi_unmap(struct asi *asi, void *addr, size_t len, bool flush_tlb) asi_flush_tlb_range(asi, addr, len); } -void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len) -{ - /* Later patches will do a more optimized flush. */ - flush_tlb_kernel_range((ulong)addr, (ulong)addr + len); -} - void *asi_va(unsigned long pa) { struct page *page = pfn_to_page(PHYS_PFN(pa)); diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 5c9681df3a16..2a442335501f 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -31,6 +31,8 @@ # define __flush_tlb_multi(msk, info) native_flush_tlb_multi(msk, info) #endif +STATIC_NOPV void native_flush_tlb_global(void); + /* * TLB flushing, formerly SMP-only * c/o Linus Torvalds. @@ -173,7 +175,6 @@ static void invalidate_kern_pcid(void) static void invalidate_asi_pcid(struct asi *asi, u16 asid) { - uint i; struct asi_tlb_context *asi_tlb_context; if (!static_cpu_has(X86_FEATURE_ASI) || @@ -183,21 +184,30 @@ static void invalidate_asi_pcid(struct asi *asi, u16 asid) asi_tlb_context = this_cpu_ptr(cpu_tlbstate.ctxs[asid].asi_context); if (asi) - asi_tlb_context[asi->pcid_index].flush_pending = true; + asi_tlb_context[asi->pcid_index] = + (struct asi_tlb_context) { 0 }; else - for (i = 1; i < ASI_MAX_NUM; i++) - asi_tlb_context[i].flush_pending = true; + memset(asi_tlb_context, 0, + sizeof(struct asi_tlb_context) * ASI_MAX_NUM); } static void flush_asi_pcid(struct asi *asi) { u16 asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid); - /* - * The flag should be cleared before the INVPCID, to avoid clearing it - * in case an interrupt/exception sets it again after the INVPCID. - */ - asi_clear_pending_tlb_flush(asi); + struct asi_tlb_context *tlb_context = this_cpu_ptr( + &cpu_tlbstate.ctxs[asid].asi_context[asi->pcid_index]); + u64 latest_local_tlb_gen = atomic64_read(asi->tlb_gen); + u64 latest_global_tlb_gen = atomic64_read( + ASI_GLOBAL_NONSENSITIVE->tlb_gen); + invpcid_flush_single_context(asi_pcid(asi, asid)); + + /* + * This could sometimes move the *_tlb_gen backwards. See comments + * in __asi_enter(). + */ + WRITE_ONCE(tlb_context->local_tlb_gen, latest_local_tlb_gen); + WRITE_ONCE(tlb_context->global_tlb_gen, latest_global_tlb_gen); } static void __flush_tlb_one_asi(struct asi *asi, u16 asid, size_t addr) @@ -1050,7 +1060,7 @@ static DEFINE_PER_CPU(unsigned int, flush_tlb_info_idx); static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned int stride_shift, bool freed_tables, - u64 new_tlb_gen) + u64 new_tlb_gen, u64 mm_ctx_id, u16 asi_pcid_index) { struct flush_tlb_info *info = this_cpu_ptr(&flush_tlb_info); @@ -1071,6 +1081,11 @@ static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm, info->new_tlb_gen = new_tlb_gen; info->initiating_cpu = smp_processor_id(); +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + info->mm_context_id = mm_ctx_id; + info->asi_pcid_index = asi_pcid_index; +#endif + return info; } @@ -1104,7 +1119,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, new_tlb_gen = inc_mm_tlb_gen(mm); info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables, - new_tlb_gen); + new_tlb_gen, 0, 0); /* * flush_tlb_multi() is not optimized for the common case in which only @@ -1157,7 +1172,7 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end) struct flush_tlb_info *info; preempt_disable(); - info = get_flush_tlb_info(NULL, start, end, 0, false, 0); + info = get_flush_tlb_info(NULL, start, end, 0, false, 0, 0, 0); on_each_cpu(do_kernel_range_flush, info, 1); @@ -1166,6 +1181,174 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end) } } +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + +static inline void invlpg_range(size_t start, size_t end, size_t stride) +{ + size_t addr; + + for (addr = start; addr < end; addr += stride) + invlpg(addr); +} + +static bool asi_needs_tlb_flush(struct asi *asi, struct flush_tlb_info *info) +{ + if (!asi || + (info->mm_context_id != U64_MAX && + info->mm_context_id != asi->mm->context.ctx_id) || + (info->asi_pcid_index && info->asi_pcid_index != asi->pcid_index)) + return false; + + if (unlikely(!(asi->class->flags & ASI_MAP_STANDARD_NONSENSITIVE)) && + (info->mm_context_id == U64_MAX || !info->asi_pcid_index)) + return false; + + return true; +} + +static void __flush_asi_tlb_all(struct asi *asi) +{ + if (static_cpu_has(X86_FEATURE_INVPCID_SINGLE)) { + flush_asi_pcid(asi); + return; + } + + /* See comments in native_flush_tlb_local() */ + if (unlikely(!asi_is_target_unrestricted()) && + asi_intr_nest_depth() == 0) { + native_flush_tlb_global(); + return; + } + + /* Let the next ASI Enter do the flush */ + asi_exit(); +} + +static void do_asi_tlb_flush(void *data) +{ + struct flush_tlb_info *info = data; + struct tlb_state *tlb_state = this_cpu_ptr(&cpu_tlbstate); + struct asi_tlb_context *tlb_context; + struct asi *asi = asi_get_current(); + u64 latest_local_tlb_gen, latest_global_tlb_gen; + u64 curr_local_tlb_gen, curr_global_tlb_gen; + u64 new_local_tlb_gen, new_global_tlb_gen; + bool do_flush_all; + + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED); + + if (!asi_needs_tlb_flush(asi, info)) + return; + + do_flush_all = info->end - info->start > + (tlb_single_page_flush_ceiling << PAGE_SHIFT); + + if (!static_cpu_has(X86_FEATURE_PCID)) { + if (do_flush_all) + __flush_asi_tlb_all(asi); + else + invlpg_range(info->start, info->end, PAGE_SIZE); + return; + } + + tlb_context = &tlb_state->ctxs[tlb_state->loaded_mm_asid] + .asi_context[asi->pcid_index]; + + asi_get_latest_tlb_gens(asi, &latest_local_tlb_gen, + &latest_global_tlb_gen); + + curr_local_tlb_gen = READ_ONCE(tlb_context->local_tlb_gen); + curr_global_tlb_gen = READ_ONCE(tlb_context->global_tlb_gen); + + if (info->mm_context_id == U64_MAX) { + new_global_tlb_gen = info->new_tlb_gen; + new_local_tlb_gen = curr_local_tlb_gen; + } else { + new_local_tlb_gen = info->new_tlb_gen; + new_global_tlb_gen = curr_global_tlb_gen; + } + + /* Somebody already did a full flush */ + if (new_local_tlb_gen <= curr_local_tlb_gen && + new_global_tlb_gen <= curr_global_tlb_gen) + return; + + /* + * If we can't bring the TLB up-to-date with a range flush, then do a + * full flush anyway. + */ + if (do_flush_all || !(new_local_tlb_gen == latest_local_tlb_gen && + new_global_tlb_gen == latest_global_tlb_gen && + new_local_tlb_gen <= curr_local_tlb_gen + 1 && + new_global_tlb_gen <= curr_global_tlb_gen + 1)) { + __flush_asi_tlb_all(asi); + return; + } + + invlpg_range(info->start, info->end, PAGE_SIZE); + + /* + * If we are still in ASI context, then all the INVLPGs flushed the + * ASI PCID and so we can update the tlb_gens. + */ + if (asi_get_current() == asi) { + WRITE_ONCE(tlb_context->local_tlb_gen, new_local_tlb_gen); + WRITE_ONCE(tlb_context->global_tlb_gen, new_global_tlb_gen); + } +} + +static bool is_asi_active_on_cpu(int cpu, void *info) +{ + return per_cpu(asi_cpu_state.curr_asi, cpu); +} + +void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len) +{ + size_t start = (size_t)addr; + size_t end = start + len; + struct flush_tlb_info *info; + u64 mm_context_id; + const cpumask_t *cpu_mask; + u64 new_tlb_gen = 0; + + if (!static_cpu_has(X86_FEATURE_ASI)) + return; + + if (static_cpu_has(X86_FEATURE_PCID)) { + new_tlb_gen = atomic64_inc_return(asi->tlb_gen); + + /* + * The increment of tlb_gen must happen before the curr_asi + * reads in is_asi_active_on_cpu(). That ensures that if another + * CPU is in asi_enter() and happens to write to curr_asi after + * is_asi_active_on_cpu() read it, it will see the updated + * tlb_gen and perform a flush during the TLB switch. + */ + smp_mb__after_atomic(); + } + + preempt_disable(); + + if (asi == ASI_GLOBAL_NONSENSITIVE) { + mm_context_id = U64_MAX; + cpu_mask = cpu_online_mask; + } else { + mm_context_id = asi->mm->context.ctx_id; + cpu_mask = mm_cpumask(asi->mm); + } + + info = get_flush_tlb_info(NULL, start, end, 0, false, new_tlb_gen, + mm_context_id, asi->pcid_index); + + on_each_cpu_cond_mask(is_asi_active_on_cpu, do_asi_tlb_flush, info, + true, cpu_mask); + + put_flush_tlb_info(); + preempt_enable(); +} + +#endif + /* * This can be used from process context to figure out what the value of * CR3 is without needing to do a (slow) __read_cr3(). @@ -1415,7 +1598,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) int cpu = get_cpu(); - info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false, 0); + info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false, 0, 0, 0); /* * flush_tlb_multi() is not optimized for the common case in which only * a local TLB flush is needed. Optimize this use-case by calling From patchwork Wed Feb 23 05:22:05 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756434 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0B143C433F5 for ; Wed, 23 Feb 2022 05:26:37 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238286AbiBWF1B (ORCPT ); Wed, 23 Feb 2022 00:27:01 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57712 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238354AbiBWF0V (ORCPT ); Wed, 23 Feb 2022 00:26:21 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 92AA66CA63 for ; Tue, 22 Feb 2022 21:25:02 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id z15-20020a25bb0f000000b00613388c7d99so26693792ybg.8 for ; Tue, 22 Feb 2022 21:25:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=xuSOdR8cu+3sckZI73No+K9Gsgd7KYxoVcTvOLd9SRE=; b=L1vzi2HE8mlZkGyGTJ2UNkXNrCVKkhnCd6edxiG2679p05lL9RiITgedK2mHCr1gL1 r2NbbiBimY0XMB+ilTAP/tinO8mfPjfgIiKo0Vo4CIbvxKkOOdhmUUJYzmRPnVEAIELz CaVtVNSnDCTPdbHIF9MBinzjbvZUfNhf/kWzWIx9iGyWw1+hTpjVB6Q9fAmZ4GqOYOQ9 UfdgAMaldxCH3MN9zplE0ZKYVWhDkyZhstobQhDLPnFPNehmwZEzxYqfvR65uiELnTbL w+77nDxtgfJrLubYlcykpyqTJJAw4YhWcHGLjL5ZvkLqkeKvUc2iiWBXMc2g3lSez3xL mBMg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=xuSOdR8cu+3sckZI73No+K9Gsgd7KYxoVcTvOLd9SRE=; b=ZNr59EM1mPumgJzSFyZl/8Ys2Rdpd4lG/1risRpjVi2eFJ0b47n6MiVkM+sfywk2OV MuW9mVMAEq+CGj9pgNwntZZ490Tt21fSWq3EFRDpEqEx6bp+Po0eqTlG+n2/xJLQ0J/g Ww69o1vud0yMb5mYjlKqFsLjCMyORhWVcfsvNf8HO41qzZVxPgmD2gDNkpPkYKhajMe9 NsaIQndoTFSRHUl75uRvEHMFm6EpBORyIsPiUiHP35xy0dIQdGAmHzpkt4TmI9dZgukN VHwfXC/FYyRXDDOR0xhwIy8DBYKmCRhnwPziE04CIyjkpVcvSzwTzKLt9qmwzmtg83Cx l1+w== X-Gm-Message-State: AOAM531/3RO9BjsXvMVmBcqSVYOJLt91eY50y7RohTfAXOXgvHffSpaI zDJeIQi42waSw5XDvrjkc38sPUM/85zw X-Google-Smtp-Source: ABdhPJxz+92U+hDuWLEHSSBb08IRQ0vjXntk0j4pQJ0mXKWbSMLKP0n+iJJ9k2gkkoSoqaVm7kKAazHDcEee X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a25:aa2c:0:b0:624:64ce:8550 with SMTP id s41-20020a25aa2c000000b0062464ce8550mr16649367ybi.105.1645593890279; Tue, 22 Feb 2022 21:24:50 -0800 (PST) Date: Tue, 22 Feb 2022 21:22:05 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-30-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 29/47] mm: asi: Reduce TLB flushes when freeing pages asynchronously From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org When we are freeing pages asynchronously (because the original free was issued with IRQs disabled), issue only one TLB flush per execution of the async work function. If there is only one page to free, we do a targeted flush for that page only. Otherwise, we just do a full flush. Signed-off-by: Junaid Shahid --- arch/x86/include/asm/tlbflush.h | 8 +++++ arch/x86/mm/tlb.c | 52 ++++++++++++++++++++------------- include/linux/mm_types.h | 30 +++++++++++++------ mm/page_alloc.c | 40 ++++++++++++++++++++----- 4 files changed, 93 insertions(+), 37 deletions(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 85315d1d2d70..7d04aa2a5f86 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -296,6 +296,14 @@ unsigned long build_cr3_pcid(pgd_t *pgd, u16 pcid, bool noflush); u16 kern_pcid(u16 asid); u16 asi_pcid(struct asi *asi, u16 asid); +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + +void __asi_prepare_tlb_flush(struct asi *asi, u64 *new_tlb_gen); +void __asi_flush_tlb_range(u64 mm_context_id, u16 pcid_index, u64 new_tlb_gen, + size_t start, size_t end, const cpumask_t *cpu_mask); + +#endif /* CONFIG_ADDRESS_SPACE_ISOLATION */ + #endif /* !MODULE */ #endif /* _ASM_X86_TLBFLUSH_H */ diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 2a442335501f..fcd2c8e92f83 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -1302,21 +1302,10 @@ static bool is_asi_active_on_cpu(int cpu, void *info) return per_cpu(asi_cpu_state.curr_asi, cpu); } -void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len) +void __asi_prepare_tlb_flush(struct asi *asi, u64 *new_tlb_gen) { - size_t start = (size_t)addr; - size_t end = start + len; - struct flush_tlb_info *info; - u64 mm_context_id; - const cpumask_t *cpu_mask; - u64 new_tlb_gen = 0; - - if (!static_cpu_has(X86_FEATURE_ASI)) - return; - if (static_cpu_has(X86_FEATURE_PCID)) { - new_tlb_gen = atomic64_inc_return(asi->tlb_gen); - + *new_tlb_gen = atomic64_inc_return(asi->tlb_gen); /* * The increment of tlb_gen must happen before the curr_asi * reads in is_asi_active_on_cpu(). That ensures that if another @@ -1326,8 +1315,35 @@ void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len) */ smp_mb__after_atomic(); } +} + +void __asi_flush_tlb_range(u64 mm_context_id, u16 pcid_index, u64 new_tlb_gen, + size_t start, size_t end, const cpumask_t *cpu_mask) +{ + struct flush_tlb_info *info; preempt_disable(); + info = get_flush_tlb_info(NULL, start, end, 0, false, new_tlb_gen, + mm_context_id, pcid_index); + + on_each_cpu_cond_mask(is_asi_active_on_cpu, do_asi_tlb_flush, info, + true, cpu_mask); + put_flush_tlb_info(); + preempt_enable(); +} + +void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len) +{ + size_t start = (size_t)addr; + size_t end = start + len; + u64 mm_context_id; + u64 new_tlb_gen = 0; + const cpumask_t *cpu_mask; + + if (!static_cpu_has(X86_FEATURE_ASI)) + return; + + __asi_prepare_tlb_flush(asi, &new_tlb_gen); if (asi == ASI_GLOBAL_NONSENSITIVE) { mm_context_id = U64_MAX; @@ -1337,14 +1353,8 @@ void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len) cpu_mask = mm_cpumask(asi->mm); } - info = get_flush_tlb_info(NULL, start, end, 0, false, new_tlb_gen, - mm_context_id, asi->pcid_index); - - on_each_cpu_cond_mask(is_asi_active_on_cpu, do_asi_tlb_flush, info, - true, cpu_mask); - - put_flush_tlb_info(); - preempt_enable(); + __asi_flush_tlb_range(mm_context_id, asi->pcid_index, new_tlb_gen, + start, end, cpu_mask); } #endif diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 56511adc263e..7d38229ca85c 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -193,21 +193,33 @@ struct page { /** @rcu_head: You can use this to free a page by RCU. */ struct rcu_head rcu_head; -#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +#if defined(CONFIG_ADDRESS_SPACE_ISOLATION) && !defined(BUILD_VDSO32) struct { /* Links the pages_to_free_async list */ struct llist_node async_free_node; unsigned long _asi_pad_1; - unsigned long _asi_pad_2; + u64 asi_tlb_gen; - /* - * Upon allocation of a locally non-sensitive page, set - * to the allocating mm. Must be set to the same mm when - * the page is freed. May potentially be overwritten in - * the meantime, as long as it is restored before free. - */ - struct mm_struct *asi_mm; + union { + /* + * Upon allocation of a locally non-sensitive + * page, set to the allocating mm. Must be set + * to the same mm when the page is freed. May + * potentially be overwritten in the meantime, + * as long as it is restored before free. + */ + struct mm_struct *asi_mm; + + /* + * Set to the above mm's context ID if the page + * is being freed asynchronously. Can't directly + * use the mm_struct, unless we take additional + * steps to avoid it from being freed while the + * async work is pending. + */ + u64 asi_mm_ctx_id; + }; }; #endif }; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 01784bff2a80..998ff6a56732 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5182,20 +5182,41 @@ static void async_free_work_fn(struct work_struct *work) { struct page *page, *tmp; struct llist_node *pages_to_free; - void *va; - size_t len; + size_t addr; uint order; pages_to_free = llist_del_all(this_cpu_ptr(&pages_to_free_async)); - /* A later patch will do a more optimized TLB flush. */ + if (!pages_to_free) + return; + + /* If we only have one page to free, then do a targeted TLB flush. */ + if (!llist_next(pages_to_free)) { + page = llist_entry(pages_to_free, struct page, async_free_node); + addr = (size_t)page_to_virt(page); + order = page->private; + + __asi_flush_tlb_range(page->asi_mm_ctx_id, 0, page->asi_tlb_gen, + addr, addr + PAGE_SIZE * (1 << order), + cpu_online_mask); + /* Need to clear, since it shares space with page->mapping. */ + page->asi_tlb_gen = 0; + + __free_the_page(page, order); + return; + } + + /* + * Otherwise, do a full flush. We could potentially try to optimize it + * via taking a union of what needs to be flushed, but it may not be + * worth the additional complexity. + */ + asi_flush_tlb_range(ASI_GLOBAL_NONSENSITIVE, 0, TLB_FLUSH_ALL); llist_for_each_entry_safe(page, tmp, pages_to_free, async_free_node) { - va = page_to_virt(page); order = page->private; - len = PAGE_SIZE * (1 << order); - - asi_flush_tlb_range(ASI_GLOBAL_NONSENSITIVE, va, len); + /* Need to clear, since it shares space with page->mapping. */ + page->asi_tlb_gen = 0; __free_the_page(page, order); } } @@ -5291,6 +5312,11 @@ static bool asi_unmap_freed_pages(struct page *page, unsigned int order) if (!async_flush_needed) return true; + page->asi_mm_ctx_id = PageGlobalNonSensitive(page) + ? U64_MAX : asi->mm->context.ctx_id; + + __asi_prepare_tlb_flush(asi, &page->asi_tlb_gen); + page->private = order; llist_add(&page->async_free_node, this_cpu_ptr(&pages_to_free_async)); From patchwork Wed Feb 23 05:22:06 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756433 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id F39C5C433EF for ; Wed, 23 Feb 2022 05:26:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237990AbiBWF1A (ORCPT ); Wed, 23 Feb 2022 00:27:00 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57808 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238328AbiBWF0Z (ORCPT ); Wed, 23 Feb 2022 00:26:25 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 546D86E4F2 for ; Tue, 22 Feb 2022 21:25:04 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-2d07ae1145aso162070187b3.4 for ; Tue, 22 Feb 2022 21:25:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=vK4stcDcFWOmnS7XFdl48O2PCgqnR6b+h+9UyqGSKHo=; b=WeE4/kJf1JBwoHaoRE1hNnNZMWwTToI7DEJHTID7WIxYsRDDOQgqC+1Tt7v7oX0Gpi ZmRLZAZ2vDdsuIDB2S8uWUw1YsFiIl0loYVOvIAhqXAjeeol/TFwLp8ORGxcWp10Znnv Fm89xpJAHuDIyCGkiR6qSwg/Ed2RYqbaYiZbeHx1AcGw15lQM8m5E4AwvqrpiEZ4xaWB fo5SYVGmwmAcFx/4QD/6NyE3+JjhT+8OXDlEZf2AotxVIkqYFj/Rh40KkPJ8CI4n5jQK 7ZdRdvip/O5Och4s64bR2f4QDQ48Fke+i6WKvHWJfT9D9ayGbTZXm6UK8dYn9Je6DvRz f9cA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=vK4stcDcFWOmnS7XFdl48O2PCgqnR6b+h+9UyqGSKHo=; b=JdAk0l+W5r0qTLImIASWFQEwfbsiJEMIEBjjgjwyiqm0JL15ftqJApDNS7ZHD6aYPm ECwOyTrOIUtDYZ8UzpveFH32dlZQK3Ku4gndZTGv+9JFI9i1UjyDftVYPRzBk7SBB3Wy WbiFB5+QgD1tWOSzhBVe4GyaljlhK8UJN/nzT/mF97VyHFTrbFvM90Nt+nY4lDSWJd06 BWt83cS9h2kBhQOls8j8i4MwwNHehyLJdJuvI7VnC4KKXDWYSlqRj9RinBqXDi++Pi8U 6VC+MIePdopb/+YzNlsv+ts8QyiHDSLz+FPAiQ1Uxls+R4ReoRAofvNrKEaYdg9rLnvO hJfg== X-Gm-Message-State: AOAM531PnwcO2hPMBVsV5n4/FOb7bHRp1dOr8qzrw2AiGKzFKiqd0rMf NNQ1+HqIo3B61dkrrcbp5EQ0a62P6HJN X-Google-Smtp-Source: ABdhPJyP9PANVGMHQ1NQcOMu/RGh/MDmkswDGY7BrfXZ28lpcuzIVCL0D2Bn/aYlOJdsoN0ZdyV9/jnSWe04 X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a05:6902:543:b0:61d:c152:bd19 with SMTP id z3-20020a056902054300b0061dc152bd19mr27379968ybs.377.1645593892611; Tue, 22 Feb 2022 21:24:52 -0800 (PST) Date: Tue, 22 Feb 2022 21:22:06 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-31-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 30/47] mm: asi: Add API for mapping userspace address ranges From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org asi_map_user()/asi_unmap_user() can be used to map userspace address ranges for ASI classes that do not specify ASI_MAP_ALL_USERSPACE. In addition, another structure, asi_pgtbl_pool, allows for pre-allocating a set of pages to avoid having to allocate memory for page tables within asi_map_user(), which makes it easier to use that function while holding locks. Signed-off-by: Junaid Shahid --- arch/x86/include/asm/asi.h | 19 +++ arch/x86/mm/asi.c | 252 ++++++++++++++++++++++++++++++++++--- include/asm-generic/asi.h | 21 ++++ include/linux/mm_types.h | 2 +- 4 files changed, 275 insertions(+), 19 deletions(-) diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h index 35421356584b..bdb2f70d4f85 100644 --- a/arch/x86/include/asm/asi.h +++ b/arch/x86/include/asm/asi.h @@ -44,6 +44,12 @@ struct asi { atomic64_t *tlb_gen; atomic64_t __tlb_gen; int64_t asi_ref_count; + rwlock_t user_map_lock; +}; + +struct asi_pgtbl_pool { + struct page *pgtbl_list; + uint count; }; DECLARE_PER_CPU_ALIGNED(struct asi_state, asi_cpu_state); @@ -74,6 +80,19 @@ void asi_do_lazy_map(struct asi *asi, size_t addr); void asi_clear_user_pgd(struct mm_struct *mm, size_t addr); void asi_clear_user_p4d(struct mm_struct *mm, size_t addr); +int asi_map_user(struct asi *asi, void *addr, size_t len, + struct asi_pgtbl_pool *pool, + size_t allowed_start, size_t allowed_end); +void asi_unmap_user(struct asi *asi, void *va, size_t len); +int asi_fill_pgtbl_pool(struct asi_pgtbl_pool *pool, uint count, gfp_t flags); +void asi_clear_pgtbl_pool(struct asi_pgtbl_pool *pool); + +static inline void asi_init_pgtbl_pool(struct asi_pgtbl_pool *pool) +{ + pool->pgtbl_list = NULL; + pool->count = 0; +} + static inline void asi_init_thread_state(struct thread_struct *thread) { thread->intr_nest_depth = 0; diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c index 29c74b6d4262..9b1bd005f343 100644 --- a/arch/x86/mm/asi.c +++ b/arch/x86/mm/asi.c @@ -86,6 +86,55 @@ void asi_unregister_class(int index) } EXPORT_SYMBOL_GPL(asi_unregister_class); +static ulong get_pgtbl_from_pool(struct asi_pgtbl_pool *pool) +{ + struct page *pgtbl; + + if (pool->count == 0) + return 0; + + pgtbl = pool->pgtbl_list; + pool->pgtbl_list = pgtbl->asi_pgtbl_pool_next; + pgtbl->asi_pgtbl_pool_next = NULL; + pool->count--; + + return (ulong)page_address(pgtbl); +} + +static void return_pgtbl_to_pool(struct asi_pgtbl_pool *pool, ulong virt) +{ + struct page *pgtbl = virt_to_page(virt); + + pgtbl->asi_pgtbl_pool_next = pool->pgtbl_list; + pool->pgtbl_list = pgtbl; + pool->count++; +} + +int asi_fill_pgtbl_pool(struct asi_pgtbl_pool *pool, uint count, gfp_t flags) +{ + if (!static_cpu_has(X86_FEATURE_ASI)) + return 0; + + while (pool->count < count) { + ulong pgtbl = get_zeroed_page(flags); + + if (!pgtbl) + return -ENOMEM; + + return_pgtbl_to_pool(pool, pgtbl); + } + + return 0; +} +EXPORT_SYMBOL_GPL(asi_fill_pgtbl_pool); + +void asi_clear_pgtbl_pool(struct asi_pgtbl_pool *pool) +{ + while (pool->count > 0) + free_page(get_pgtbl_from_pool(pool)); +} +EXPORT_SYMBOL_GPL(asi_clear_pgtbl_pool); + static void asi_clone_pgd(pgd_t *dst_table, pgd_t *src_table, size_t addr) { pgd_t *src = pgd_offset_pgd(src_table, addr); @@ -110,10 +159,12 @@ static void asi_clone_pgd(pgd_t *dst_table, pgd_t *src_table, size_t addr) #define DEFINE_ASI_PGTBL_ALLOC(base, level) \ static level##_t * asi_##level##_alloc(struct asi *asi, \ base##_t *base, ulong addr, \ - gfp_t flags) \ + gfp_t flags, \ + struct asi_pgtbl_pool *pool) \ { \ if (unlikely(base##_none(*base))) { \ - ulong pgtbl = get_zeroed_page(flags); \ + ulong pgtbl = pool ? get_pgtbl_from_pool(pool) \ + : get_zeroed_page(flags); \ phys_addr_t pgtbl_pa; \ \ if (pgtbl == 0) \ @@ -127,7 +178,10 @@ static level##_t * asi_##level##_alloc(struct asi *asi, \ mm_inc_nr_##level##s(asi->mm); \ } else { \ paravirt_release_##level(PHYS_PFN(pgtbl_pa)); \ - free_page(pgtbl); \ + if (pool) \ + return_pgtbl_to_pool(pool, pgtbl); \ + else \ + free_page(pgtbl); \ } \ \ /* NOP on native. PV call on Xen. */ \ @@ -336,6 +390,7 @@ int asi_init(struct mm_struct *mm, int asi_index, struct asi **out_asi) asi->class = &asi_class[asi_index]; asi->mm = mm; asi->pcid_index = asi_index; + rwlock_init(&asi->user_map_lock); if (asi->class->flags & ASI_MAP_STANDARD_NONSENSITIVE) { uint i; @@ -650,11 +705,6 @@ static bool follow_physaddr(struct mm_struct *mm, size_t virt, /* * Map the given range into the ASI page tables. The source of the mapping * is the regular unrestricted page tables. - * Can be used to map any kernel memory. - * - * The caller MUST ensure that the source mapping will not change during this - * function. For dynamic kernel memory, this is generally ensured by mapping - * the memory within the allocator. * * If the source mapping is a large page and the range being mapped spans the * entire large page, then it will be mapped as a large page in the ASI page @@ -664,19 +714,17 @@ static bool follow_physaddr(struct mm_struct *mm, size_t virt, * destination page, but that should be ok for now, as usually in such cases, * the range would consist of a small-ish number of pages. */ -int asi_map_gfp(struct asi *asi, void *addr, size_t len, gfp_t gfp_flags) +int __asi_map(struct asi *asi, size_t start, size_t end, gfp_t gfp_flags, + struct asi_pgtbl_pool *pool, + size_t allowed_start, size_t allowed_end) { size_t virt; - size_t start = (size_t)addr; - size_t end = start + len; size_t page_size; - if (!static_cpu_has(X86_FEATURE_ASI) || !asi) - return 0; - VM_BUG_ON(start & ~PAGE_MASK); - VM_BUG_ON(len & ~PAGE_MASK); - VM_BUG_ON(start < TASK_SIZE_MAX); + VM_BUG_ON(end & ~PAGE_MASK); + VM_BUG_ON(end > allowed_end); + VM_BUG_ON(start < allowed_start); gfp_flags &= GFP_RECLAIM_MASK; @@ -702,14 +750,15 @@ int asi_map_gfp(struct asi *asi, void *addr, size_t len, gfp_t gfp_flags) continue; \ } \ \ - level = asi_##level##_alloc(asi, base, virt, gfp_flags);\ + level = asi_##level##_alloc(asi, base, virt, \ + gfp_flags, pool); \ if (!level) \ return -ENOMEM; \ \ if (page_size >= LEVEL##_SIZE && \ (level##_none(*level) || level##_leaf(*level)) && \ is_page_within_range(virt, LEVEL##_SIZE, \ - start, end)) { \ + allowed_start, allowed_end)) {\ page_size = LEVEL##_SIZE; \ phys &= LEVEL##_MASK; \ \ @@ -737,6 +786,26 @@ int asi_map_gfp(struct asi *asi, void *addr, size_t len, gfp_t gfp_flags) return 0; } +/* + * Maps the given kernel address range into the ASI page tables. + * + * The caller MUST ensure that the source mapping will not change during this + * function. For dynamic kernel memory, this is generally ensured by mapping + * the memory within the allocator. + */ +int asi_map_gfp(struct asi *asi, void *addr, size_t len, gfp_t gfp_flags) +{ + size_t start = (size_t)addr; + size_t end = start + len; + + if (!static_cpu_has(X86_FEATURE_ASI) || !asi) + return 0; + + VM_BUG_ON(start < TASK_SIZE_MAX); + + return __asi_map(asi, start, end, gfp_flags, NULL, start, end); +} + int asi_map(struct asi *asi, void *addr, size_t len) { return asi_map_gfp(asi, addr, len, GFP_KERNEL); @@ -935,3 +1004,150 @@ void asi_clear_user_p4d(struct mm_struct *mm, size_t addr) if (!pgtable_l5_enabled()) __asi_clear_user_pgd(mm, addr); } + +/* + * Maps the given userspace address range into the ASI page tables. + * + * The caller MUST ensure that the source mapping will not change during this + * function e.g. by synchronizing via MMU notifiers or acquiring the + * appropriate locks. + */ +int asi_map_user(struct asi *asi, void *addr, size_t len, + struct asi_pgtbl_pool *pool, + size_t allowed_start, size_t allowed_end) +{ + int err; + size_t start = (size_t)addr; + size_t end = start + len; + + if (!static_cpu_has(X86_FEATURE_ASI) || !asi) + return 0; + + VM_BUG_ON(end > TASK_SIZE_MAX); + + read_lock(&asi->user_map_lock); + err = __asi_map(asi, start, end, GFP_NOWAIT, pool, + allowed_start, allowed_end); + read_unlock(&asi->user_map_lock); + + return err; +} +EXPORT_SYMBOL_GPL(asi_map_user); + +static bool +asi_unmap_free_pte_range(struct asi_pgtbl_pool *pgtbls_to_free, + pte_t *pte, size_t addr, size_t end) +{ + do { + pte_clear(NULL, addr, pte); + } while (pte++, addr += PAGE_SIZE, addr != end); + + return true; +} + +#define DEFINE_ASI_UNMAP_FREE_RANGE(level, LEVEL, next_level, NEXT_LVL_SIZE) \ +static bool \ +asi_unmap_free_##level##_range(struct asi_pgtbl_pool *pgtbls_to_free, \ + level##_t *level, size_t addr, size_t end) \ +{ \ + bool unmapped = false; \ + size_t next; \ + \ + do { \ + next = level##_addr_end(addr, end); \ + if (level##_none(*level)) \ + continue; \ + \ + if (IS_ALIGNED(addr, LEVEL##_SIZE) && \ + IS_ALIGNED(next, LEVEL##_SIZE)) { \ + if (!level##_large(*level)) { \ + ulong pgtbl = level##_page_vaddr(*level); \ + struct page *page = virt_to_page(pgtbl); \ + \ + page->private = PG_LEVEL_##NEXT_LVL_SIZE; \ + return_pgtbl_to_pool(pgtbls_to_free, pgtbl); \ + } \ + level##_clear(level); \ + unmapped = true; \ + } else { \ + /* \ + * At this time, we don't have a case where we need to \ + * unmap a subset of a huge page. But that could arise \ + * in the future. In that case, we'll need to split \ + * the huge mapping here. \ + */ \ + if (WARN_ON(level##_large(*level))) \ + continue; \ + \ + unmapped |= asi_unmap_free_##next_level##_range( \ + pgtbls_to_free, \ + next_level##_offset(level, addr), \ + addr, next); \ + } \ + } while (level++, addr = next, addr != end); \ + \ + return unmapped; \ +} + +DEFINE_ASI_UNMAP_FREE_RANGE(pmd, PMD, pte, 4K) +DEFINE_ASI_UNMAP_FREE_RANGE(pud, PUD, pmd, 2M) +DEFINE_ASI_UNMAP_FREE_RANGE(p4d, P4D, pud, 1G) +DEFINE_ASI_UNMAP_FREE_RANGE(pgd, PGDIR, p4d, 512G) + +static bool asi_unmap_and_free_range(struct asi_pgtbl_pool *pgtbls_to_free, + struct asi *asi, size_t addr, size_t end) +{ + size_t next; + bool unmapped = false; + pgd_t *pgd = pgd_offset_pgd(asi->pgd, addr); + + BUILD_BUG_ON((void *)&((struct page *)NULL)->private == + (void *)&((struct page *)NULL)->asi_pgtbl_pool_next); + + if (pgtable_l5_enabled()) + return asi_unmap_free_pgd_range(pgtbls_to_free, pgd, addr, end); + + do { + next = pgd_addr_end(addr, end); + unmapped |= asi_unmap_free_p4d_range(pgtbls_to_free, + p4d_offset(pgd, addr), + addr, next); + } while (pgd++, addr = next, addr != end); + + return unmapped; +} + +void asi_unmap_user(struct asi *asi, void *addr, size_t len) +{ + static void (*const free_pgtbl_at_level[])(struct asi *, size_t) = { + NULL, + asi_free_pte, + asi_free_pmd, + asi_free_pud, + asi_free_p4d + }; + + struct asi_pgtbl_pool pgtbls_to_free = { 0 }; + size_t start = (size_t)addr; + size_t end = start + len; + bool unmapped; + + if (!static_cpu_has(X86_FEATURE_ASI) || !asi) + return; + + write_lock(&asi->user_map_lock); + unmapped = asi_unmap_and_free_range(&pgtbls_to_free, asi, start, end); + write_unlock(&asi->user_map_lock); + + if (unmapped) + asi_flush_tlb_range(asi, addr, len); + + while (pgtbls_to_free.count > 0) { + size_t pgtbl = get_pgtbl_from_pool(&pgtbls_to_free); + struct page *page = virt_to_page(pgtbl); + + VM_BUG_ON(page->private >= PG_LEVEL_NUM); + free_pgtbl_at_level[page->private](asi, pgtbl); + } +} +EXPORT_SYMBOL_GPL(asi_unmap_user); diff --git a/include/asm-generic/asi.h b/include/asm-generic/asi.h index 8513d0d7865a..fffb323d2a00 100644 --- a/include/asm-generic/asi.h +++ b/include/asm-generic/asi.h @@ -26,6 +26,7 @@ struct asi_hooks {}; struct asi {}; +struct asi_pgtbl_pool {}; static inline int asi_register_class(const char *name, uint flags, @@ -92,6 +93,26 @@ void asi_clear_user_pgd(struct mm_struct *mm, size_t addr) { } static inline void asi_clear_user_p4d(struct mm_struct *mm, size_t addr) { } +static inline +int asi_map_user(struct asi *asi, void *addr, size_t len, + struct asi_pgtbl_pool *pool, + size_t allowed_start, size_t allowed_end) +{ + return 0; +} + +static inline void asi_unmap_user(struct asi *asi, void *va, size_t len) { } + +static inline +int asi_fill_pgtbl_pool(struct asi_pgtbl_pool *pool, uint count, gfp_t flags) +{ + return 0; +} + +static inline void asi_clear_pgtbl_pool(struct asi_pgtbl_pool *pool) { } + +static inline void asi_init_pgtbl_pool(struct asi_pgtbl_pool *pool) { } + static inline void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len) { } diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 7d38229ca85c..c3f209720a84 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -198,7 +198,7 @@ struct page { /* Links the pages_to_free_async list */ struct llist_node async_free_node; - unsigned long _asi_pad_1; + struct page *asi_pgtbl_pool_next; u64 asi_tlb_gen; union { From patchwork Wed Feb 23 05:22:07 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756437 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CA2F3C433F5 for ; Wed, 23 Feb 2022 05:26:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238460AbiBWF1U (ORCPT ); Wed, 23 Feb 2022 00:27:20 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58046 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238371AbiBWF0j (ORCPT ); Wed, 23 Feb 2022 00:26:39 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E0BEB6E576 for ; Tue, 22 Feb 2022 21:25:07 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-2d07ae11467so163389687b3.12 for ; Tue, 22 Feb 2022 21:25:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=Wjvyoa2FJqkBAKFP17bCuJdeRktO4ffR+raoUv+wC04=; b=IuC+day3sN1PmeI3q0+DeqehWlmR6vmu4lqtMz2zTkSs/sLWgZjKSMjbDxjarGYK3w hUZZQKn00coi8l0R0gTq63im2iXgDqy8kS7ZoG/GfBb24X2etuAVLlEbTVnHHQVOpffA 7kJw+tCwaYdWlS1krugMDGNM7sJ3OTIDt7ytbn7/z+HK1a8ew2aSceVwKePXDv9f7WZz m0Q/zwZ6fjCsUI3xB5Psf0pqjVkAy2TXCzeSjVt+m4RXqs776AhpaQ80FhrK9Kk/bCDo Sz1kG0HnL9LnI2+RNBaFmgg3coHxCTDPj+PGc/Rc5iXJGNpAzaSB6KONLlzImp2qWQaO osLQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=Wjvyoa2FJqkBAKFP17bCuJdeRktO4ffR+raoUv+wC04=; b=ZMWAb3gDJDGAIxAg12N2DkBPm8sCQdUMQWT23E/UKkzu83AuNIfNcynW3X42GLuMsh 14hb/MeaALY6cxQ2D2BzIW6l2RVZhCT6qbwsmIB9zyyVgdwqP3M8xyHRfvcnqGFbb60z hmgwzqnYeIRGFEz/q1xo71FedgawKmdNPI5OD9jXmHRurKfKDX6FQQbGJZ/pAsHHneGm Vb73maWwgUjj98GEZ2GE6gylmc1wTnAbK4qxay3bnIvohB1HdYDKHqbZ2VM5mKojKDpf UPw09OGQJf+eXOd5IBT64nVFRaekALWqqsRkpU/XU3ETVsDcGpRr8TyQ4tlRQBbigg61 8+JQ== X-Gm-Message-State: AOAM533advuWf8e0mJVeO1Mu14t76QMnwuxPWYwjJflAwL65wTi2WXlf /crNmh+93Ws75RJ3UY+9TK4OrR4yaNSO X-Google-Smtp-Source: ABdhPJyc5lR4Rjfv+Xgr2/PVDwYzQX3XdlHXcXxiuT59EhJPeYW9yxuIt2bJoEcFUWJJKvKStyvmN4Ecr+Gy X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a0d:e4c2:0:b0:2d4:da21:cc07 with SMTP id n185-20020a0de4c2000000b002d4da21cc07mr27147139ywe.16.1645593894760; Tue, 22 Feb 2022 21:24:54 -0800 (PST) Date: Tue, 22 Feb 2022 21:22:07 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-32-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 31/47] mm: asi: Support for non-sensitive SLUB caches From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org This adds support for allocating global and local non-sensitive objects using the SLUB allocator. Similar to SLAB, per-process child caches are created for locally non-sensitive allocations. This mechanism is based on a modified form of the earlier implementation of per-memcg caches. Signed-off-by: Junaid Shahid --- include/linux/slub_def.h | 6 ++ mm/slab.h | 5 ++ mm/slab_common.c | 33 +++++++-- mm/slub.c | 140 ++++++++++++++++++++++++++++++++++++++- security/Kconfig | 3 +- 5 files changed, 179 insertions(+), 8 deletions(-) diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index 0fa751b946fa..6e185b61582c 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -137,6 +137,12 @@ struct kmem_cache { struct kasan_cache kasan_info; #endif +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + struct kmem_local_cache_info local_cache_info; + /* For propagation, maximum size of a stored attr */ + unsigned int max_attr_size; +#endif + unsigned int useroffset; /* Usercopy region offset */ unsigned int usersize; /* Usercopy region size */ diff --git a/mm/slab.h b/mm/slab.h index b9e11038be27..8799bcdd2fff 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -216,6 +216,7 @@ int __kmem_cache_shutdown(struct kmem_cache *); void __kmem_cache_release(struct kmem_cache *); int __kmem_cache_shrink(struct kmem_cache *); void slab_kmem_cache_release(struct kmem_cache *); +void kmem_cache_shrink_all(struct kmem_cache *s); struct seq_file; struct file; @@ -344,6 +345,7 @@ void restore_page_nonsensitive_metadata(struct page *page, } void set_nonsensitive_cache_params(struct kmem_cache *s); +void init_local_cache_info(struct kmem_cache *s, struct kmem_cache *root); #else /* CONFIG_ADDRESS_SPACE_ISOLATION */ @@ -380,6 +382,9 @@ static inline void restore_page_nonsensitive_metadata(struct page *page, static inline void set_nonsensitive_cache_params(struct kmem_cache *s) { } +static inline +void init_local_cache_info(struct kmem_cache *s, struct kmem_cache *root) { } + #endif /* CONFIG_ADDRESS_SPACE_ISOLATION */ #ifdef CONFIG_MEMCG_KMEM diff --git a/mm/slab_common.c b/mm/slab_common.c index b486b72d6344..efa61b97902a 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -142,7 +142,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr, LIST_HEAD(slab_root_caches); -static void init_local_cache_info(struct kmem_cache *s, struct kmem_cache *root) +void init_local_cache_info(struct kmem_cache *s, struct kmem_cache *root) { if (root) { s->local_cache_info.root_cache = root; @@ -194,9 +194,6 @@ void set_nonsensitive_cache_params(struct kmem_cache *s) #else -static inline -void init_local_cache_info(struct kmem_cache *s, struct kmem_cache *root) { } - static inline void cleanup_local_cache_info(struct kmem_cache *s) { } #endif /* CONFIG_ADDRESS_SPACE_ISOLATION */ @@ -644,6 +641,34 @@ int kmem_cache_shrink(struct kmem_cache *cachep) } EXPORT_SYMBOL(kmem_cache_shrink); +/** + * kmem_cache_shrink_all - shrink a cache and all child caches for root cache + * @s: The cache pointer + */ +void kmem_cache_shrink_all(struct kmem_cache *s) +{ + struct kmem_cache *c; + + if (!static_asi_enabled() || !is_root_cache(s)) { + kmem_cache_shrink(s); + return; + } + + kasan_cache_shrink(s); + __kmem_cache_shrink(s); + + /* + * We have to take the slab_mutex to protect from the child cache list + * modification. + */ + mutex_lock(&slab_mutex); + for_each_child_cache(c, s) { + kasan_cache_shrink(c); + __kmem_cache_shrink(c); + } + mutex_unlock(&slab_mutex); +} + bool slab_is_available(void) { return slab_state >= UP; diff --git a/mm/slub.c b/mm/slub.c index abe7db581d68..df0191f8b0e2 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -289,6 +289,21 @@ static void debugfs_slab_add(struct kmem_cache *); static inline void debugfs_slab_add(struct kmem_cache *s) { } #endif +#if defined(CONFIG_SYSFS) && defined(CONFIG_ADDRESS_SPACE_ISOLATION) +static void propagate_slab_attrs_from_parent(struct kmem_cache *s); +static void propagate_slab_attr_to_children(struct kmem_cache *s, + struct attribute *attr, + const char *buf, size_t len); +#else +static inline void propagate_slab_attrs_from_parent(struct kmem_cache *s) { } + +static inline +void propagate_slab_attr_to_children(struct kmem_cache *s, + struct attribute *attr, + const char *buf, size_t len) +{ } +#endif + static inline void stat(const struct kmem_cache *s, enum stat_item si) { #ifdef CONFIG_SLUB_STATS @@ -2015,6 +2030,7 @@ static void __free_slab(struct kmem_cache *s, struct page *page) if (current->reclaim_state) current->reclaim_state->reclaimed_slab += pages; unaccount_slab_page(page, order, s); + restore_page_nonsensitive_metadata(page, s); __free_pages(page, order); } @@ -4204,6 +4220,8 @@ static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags) } } + set_nonsensitive_cache_params(s); + #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \ defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE) if (system_has_cmpxchg_double() && (s->flags & SLAB_NO_CMPXCHG) == 0) @@ -4797,6 +4815,10 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache) #endif } list_add(&s->list, &slab_caches); + init_local_cache_info(s, NULL); +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + list_del(&static_cache->root_caches_node); +#endif return s; } @@ -4863,7 +4885,7 @@ struct kmem_cache * __kmem_cache_alias(const char *name, unsigned int size, unsigned int align, slab_flags_t flags, void (*ctor)(void *)) { - struct kmem_cache *s; + struct kmem_cache *s, *c; s = find_mergeable(size, align, flags, name, ctor); if (s) { @@ -4876,6 +4898,11 @@ __kmem_cache_alias(const char *name, unsigned int size, unsigned int align, s->object_size = max(s->object_size, size); s->inuse = max(s->inuse, ALIGN(size, sizeof(void *))); + for_each_child_cache(c, s) { + c->object_size = s->object_size; + c->inuse = max(c->inuse, ALIGN(size, sizeof(void *))); + } + if (sysfs_slab_alias(s, name)) { s->refcount--; s = NULL; @@ -4889,6 +4916,9 @@ int __kmem_cache_create(struct kmem_cache *s, slab_flags_t flags) { int err; + if (!static_asi_enabled()) + flags &= ~SLAB_NONSENSITIVE; + err = kmem_cache_open(s, flags); if (err) return err; @@ -4897,6 +4927,8 @@ int __kmem_cache_create(struct kmem_cache *s, slab_flags_t flags) if (slab_state <= UP) return 0; + propagate_slab_attrs_from_parent(s); + err = sysfs_slab_add(s); if (err) { __kmem_cache_release(s); @@ -5619,7 +5651,7 @@ static ssize_t shrink_store(struct kmem_cache *s, const char *buf, size_t length) { if (buf[0] == '1') - kmem_cache_shrink(s); + kmem_cache_shrink_all(s); else return -EINVAL; return length; @@ -5829,6 +5861,87 @@ static ssize_t slab_attr_show(struct kobject *kobj, return err; } +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + +static void propagate_slab_attrs_from_parent(struct kmem_cache *s) +{ + int i; + char *buffer = NULL; + struct kmem_cache *root_cache; + + if (is_root_cache(s)) + return; + + root_cache = s->local_cache_info.root_cache; + + /* + * This mean this cache had no attribute written. Therefore, no point + * in copying default values around + */ + if (!root_cache->max_attr_size) + return; + + for (i = 0; i < ARRAY_SIZE(slab_attrs); i++) { + char mbuf[64]; + char *buf; + struct slab_attribute *attr = to_slab_attr(slab_attrs[i]); + ssize_t len; + + if (!attr || !attr->store || !attr->show) + continue; + + /* + * It is really bad that we have to allocate here, so we will + * do it only as a fallback. If we actually allocate, though, + * we can just use the allocated buffer until the end. + * + * Most of the slub attributes will tend to be very small in + * size, but sysfs allows buffers up to a page, so they can + * theoretically happen. + */ + if (buffer) { + buf = buffer; + } else if (root_cache->max_attr_size < ARRAY_SIZE(mbuf) && + !IS_ENABLED(CONFIG_SLUB_STATS)) { + buf = mbuf; + } else { + buffer = (char *)get_zeroed_page(GFP_KERNEL); + if (WARN_ON(!buffer)) + continue; + buf = buffer; + } + + len = attr->show(root_cache, buf); + if (len > 0) + attr->store(s, buf, len); + } + + if (buffer) + free_page((unsigned long)buffer); +} + +static void propagate_slab_attr_to_children(struct kmem_cache *s, + struct attribute *attr, + const char *buf, size_t len) +{ + struct kmem_cache *c; + struct slab_attribute *attribute = to_slab_attr(attr); + + if (static_asi_enabled()) { + mutex_lock(&slab_mutex); + + if (s->max_attr_size < len) + s->max_attr_size = len; + + for_each_child_cache(c, s) + attribute->store(c, buf, len); + + mutex_unlock(&slab_mutex); + } +} + +#endif + static ssize_t slab_attr_store(struct kobject *kobj, struct attribute *attr, const char *buf, size_t len) @@ -5844,6 +5957,27 @@ static ssize_t slab_attr_store(struct kobject *kobj, return -EIO; err = attribute->store(s, buf, len); + + /* + * This is a best effort propagation, so this function's return + * value will be determined by the parent cache only. This is + * basically because not all attributes will have a well + * defined semantics for rollbacks - most of the actions will + * have permanent effects. + * + * Returning the error value of any of the children that fail + * is not 100 % defined, in the sense that users seeing the + * error code won't be able to know anything about the state of + * the cache. + * + * Only returning the error code for the parent cache at least + * has well defined semantics. The cache being written to + * directly either failed or succeeded, in which case we loop + * through the descendants with best-effort propagation. + */ + if (slab_state >= FULL && err >= 0 && is_root_cache(s)) + propagate_slab_attr_to_children(s, attr, buf, len); + return err; } @@ -5866,7 +6000,7 @@ static struct kset *slab_kset; static inline struct kset *cache_kset(struct kmem_cache *s) { - return slab_kset; + return is_root_cache(s) ? slab_kset : NULL; } #define ID_STR_LENGTH 64 diff --git a/security/Kconfig b/security/Kconfig index 070a948b5266..a5cfb09352b0 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -68,7 +68,8 @@ config PAGE_TABLE_ISOLATION config ADDRESS_SPACE_ISOLATION bool "Allow code to run with a reduced kernel address space" default n - depends on X86_64 && !UML && SLAB && !NEED_PER_CPU_KM + depends on X86_64 && !UML && !NEED_PER_CPU_KM + depends on SLAB || SLUB depends on !PARAVIRT depends on !MEMORY_HOTPLUG help From patchwork Wed Feb 23 05:22:08 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756445 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6D857C4332F for ; Wed, 23 Feb 2022 05:28:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238514AbiBWF1y (ORCPT ); Wed, 23 Feb 2022 00:27:54 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58084 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238341AbiBWF0t (ORCPT ); Wed, 23 Feb 2022 00:26:49 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 57B3A6D1A2 for ; Tue, 22 Feb 2022 21:25:09 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id z15-20020a25bb0f000000b00613388c7d99so26693968ybg.8 for ; Tue, 22 Feb 2022 21:25:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=8mE7Rl7hvX81PshmRchAKXUw+5mI1RoqZfuGnjfnK/Y=; b=nQn+PHDiNT2rMQDvqUgnDcQMr757vUw1xTfeWksvd/uBs+Y7jycIPiRL7S70urPZ7R 8fkcrVfE/buziQatz1VzPrbZiKkVL88UDR47kWH9EmA5cOthED7jMA11HUYJo4xm65uX LbfLzZbh1vmkmQVGPzCZHhrvQkG2yMMs6jlVlPGs+f1MYI9eWUOQGRHF/yHWkKAQuG9+ 90/ZWnlFIBlEl9DgOrfwf+XgiBD2V3yGyekMz3rvqGQAKX6yRkMkTlYi4dCG11i6F1bh Phi+HrSt7E6OdAajjE7T2WGE+PvAmrOCFwLD3vGl5nSuUdeg1Ly1y8XU57nLBdCyfJFb Mrkw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=8mE7Rl7hvX81PshmRchAKXUw+5mI1RoqZfuGnjfnK/Y=; b=Uesk2CEKgc4E1BbLIOIa121hddf52Jlcxsa+KyRiyC4MNVTLOIP2W3WIx3R+l9w8Sn EPsSOGH/17xyxTtSJaigoCtd7sEDfuACMv2e3Wiu4Gs5gDRZdEnu2nOOECOZxNz3QY83 5k0PIN3HUCANZaMJPxY6IiI/8fJYjYshNC3hNl7+rOUufCdyS9+tT6w1vo9t8MQS4UzB d1oTKAfS5H5Bz7syP5egMqzAPNc0J4FnIaO/mwOMpaBYmIGxyBUegsyTZBc294UHCLdM OlBt+AVDoG/0uQH+Ptl8xVdcUDbZ7BLL/e3Ito0WTV4rYubk1F35HoJLQdMG4I/PJMDY v0pQ== X-Gm-Message-State: AOAM530GaDCKUAAx8BNA+oWdtIzGmjmmaEEY4t/9eggLTn2hjcFfA4mD IG5Xiq3Xc9WvrtfMg/3puRBKJ6tAWfBw X-Google-Smtp-Source: ABdhPJw219sWzgLSxca1tmB8U/7JNof+qQBV1MFnda0VeJBWRGjfZ3IqT80R/0u8HdfUK/arZaU6tbXLeppv X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a0d:df4e:0:b0:2d0:ab1e:6055 with SMTP id i75-20020a0ddf4e000000b002d0ab1e6055mr27301388ywe.333.1645593896772; Tue, 22 Feb 2022 21:24:56 -0800 (PST) Date: Tue, 22 Feb 2022 21:22:08 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-33-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 32/47] x86: asi: Allocate FPU state separately when ASI is enabled. From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org We are going to be mapping the task_struct in the restricted ASI address space. However, the task_struct also contains the FPU register state embedded inside it, which can contain sensitive information. So when ASI is enabled, always allocate the FPU state from a separate slab cache to keep it out of task_struct. Signed-off-by: Junaid Shahid --- arch/x86/include/asm/fpu/api.h | 1 + arch/x86/kernel/fpu/core.c | 45 ++++++++++++++++++++++++++++++++-- arch/x86/kernel/fpu/init.c | 7 ++++-- arch/x86/kernel/fpu/internal.h | 1 + arch/x86/kernel/fpu/xstate.c | 21 +++++++++++++--- arch/x86/kernel/process.c | 7 +++++- 6 files changed, 74 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h index c2767a6a387e..6f5ca3c2ef4a 100644 --- a/arch/x86/include/asm/fpu/api.h +++ b/arch/x86/include/asm/fpu/api.h @@ -112,6 +112,7 @@ extern void fpu__init_cpu(void); extern void fpu__init_system(struct cpuinfo_x86 *c); extern void fpu__init_check_bugs(void); extern void fpu__resume_cpu(void); +extern void fpstate_cache_init(void); #ifdef CONFIG_MATH_EMULATION extern void fpstate_init_soft(struct swregs_state *soft); diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index 8ea306b1bf8e..d7859573973d 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -59,6 +59,8 @@ static DEFINE_PER_CPU(bool, in_kernel_fpu); */ DEFINE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx); +struct kmem_cache *fpstate_cachep; + static bool kernel_fpu_disabled(void) { return this_cpu_read(in_kernel_fpu); @@ -443,7 +445,9 @@ static void __fpstate_reset(struct fpstate *fpstate) void fpstate_reset(struct fpu *fpu) { /* Set the fpstate pointer to the default fpstate */ - fpu->fpstate = &fpu->__fpstate; + if (!cpu_feature_enabled(X86_FEATURE_ASI)) + fpu->fpstate = &fpu->__fpstate; + __fpstate_reset(fpu->fpstate); /* Initialize the permission related info in fpu */ @@ -464,6 +468,26 @@ static inline void fpu_inherit_perms(struct fpu *dst_fpu) } } +void fpstate_cache_init(void) +{ + if (cpu_feature_enabled(X86_FEATURE_ASI)) { + size_t fpstate_size; + + /* TODO: Is the ALIGN-64 really needed? */ + fpstate_size = fpu_kernel_cfg.default_size + + ALIGN(offsetof(struct fpstate, regs), 64); + + fpstate_cachep = kmem_cache_create_usercopy( + "fpstate", + fpstate_size, + __alignof__(struct fpstate), + SLAB_PANIC | SLAB_ACCOUNT, + offsetof(struct fpstate, regs), + fpu_kernel_cfg.default_size, + NULL); + } +} + /* Clone current's FPU state on fork */ int fpu_clone(struct task_struct *dst, unsigned long clone_flags) { @@ -473,6 +497,22 @@ int fpu_clone(struct task_struct *dst, unsigned long clone_flags) /* The new task's FPU state cannot be valid in the hardware. */ dst_fpu->last_cpu = -1; + if (cpu_feature_enabled(X86_FEATURE_ASI)) { + dst_fpu->fpstate = kmem_cache_alloc_node( + fpstate_cachep, GFP_KERNEL, + page_to_nid(virt_to_page(dst))); + if (!dst_fpu->fpstate) + return -ENOMEM; + + /* + * TODO: We may be able to skip the copy since the registers are + * restored below anyway. + */ + memcpy(dst_fpu->fpstate, src_fpu->fpstate, + fpu_kernel_cfg.default_size + + offsetof(struct fpstate, regs)); + } + fpstate_reset(dst_fpu); if (!cpu_feature_enabled(X86_FEATURE_FPU)) @@ -531,7 +571,8 @@ int fpu_clone(struct task_struct *dst, unsigned long clone_flags) void fpu_thread_struct_whitelist(unsigned long *offset, unsigned long *size) { *offset = offsetof(struct thread_struct, fpu.__fpstate.regs); - *size = fpu_kernel_cfg.default_size; + *size = cpu_feature_enabled(X86_FEATURE_ASI) + ? 0 : fpu_kernel_cfg.default_size; } /* diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c index 621f4b6cac4a..8b722bf98135 100644 --- a/arch/x86/kernel/fpu/init.c +++ b/arch/x86/kernel/fpu/init.c @@ -161,9 +161,11 @@ static void __init fpu__init_task_struct_size(void) /* * Add back the dynamically-calculated register state - * size. + * size, except when ASI is enabled, since in that case + * the FPU state is always allocated dynamically. */ - task_size += fpu_kernel_cfg.default_size; + if (!cpu_feature_enabled(X86_FEATURE_ASI)) + task_size += fpu_kernel_cfg.default_size; /* * We dynamically size 'struct fpu', so we require that @@ -223,6 +225,7 @@ static void __init fpu__init_init_fpstate(void) */ void __init fpu__init_system(struct cpuinfo_x86 *c) { + current->thread.fpu.fpstate = ¤t->thread.fpu.__fpstate; fpstate_reset(¤t->thread.fpu); fpu__init_system_early_generic(c); diff --git a/arch/x86/kernel/fpu/internal.h b/arch/x86/kernel/fpu/internal.h index dbdb31f55fc7..30acc7d0cb1a 100644 --- a/arch/x86/kernel/fpu/internal.h +++ b/arch/x86/kernel/fpu/internal.h @@ -3,6 +3,7 @@ #define __X86_KERNEL_FPU_INTERNAL_H extern struct fpstate init_fpstate; +extern struct kmem_cache *fpstate_cachep; /* CPU feature check wrappers */ static __always_inline __pure bool use_xsave(void) diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index d28829403ed0..96d12f351f19 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include @@ -1495,8 +1496,15 @@ arch_initcall(xfd_update_static_branch) void fpstate_free(struct fpu *fpu) { - if (fpu->fpstate && fpu->fpstate != &fpu->__fpstate) - vfree(fpu->fpstate); + WARN_ON_ONCE(cpu_feature_enabled(X86_FEATURE_ASI) && + fpu->fpstate == &fpu->__fpstate); + + if (fpu->fpstate && fpu->fpstate != &fpu->__fpstate) { + if (fpu->fpstate->is_valloc) + vfree(fpu->fpstate); + else + kmem_cache_free(fpstate_cachep, fpu->fpstate); + } } /** @@ -1574,7 +1582,14 @@ static int fpstate_realloc(u64 xfeatures, unsigned int ksize, fpregs_unlock(); - vfree(curfps); + WARN_ON_ONCE(cpu_feature_enabled(X86_FEATURE_ASI) && !curfps); + if (curfps) { + if (curfps->is_valloc) + vfree(curfps); + else + kmem_cache_free(fpstate_cachep, curfps); + } + return 0; } diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index c8d4a00a4de7..f9bd1c3415d4 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -80,6 +80,11 @@ EXPORT_PER_CPU_SYMBOL(cpu_tss_rw); DEFINE_PER_CPU(bool, __tss_limit_invalid); EXPORT_PER_CPU_SYMBOL_GPL(__tss_limit_invalid); +void __init arch_task_cache_init(void) +{ + fpstate_cache_init(); +} + /* * this gets called so that we can store lazy state into memory and copy the * current task into the new thread. @@ -101,7 +106,7 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src) #ifdef CONFIG_X86_64 void arch_release_task_struct(struct task_struct *tsk) { - if (fpu_state_size_dynamic()) + if (fpu_state_size_dynamic() || cpu_feature_enabled(X86_FEATURE_ASI)) fpstate_free(&tsk->thread.fpu); } #endif From patchwork Wed Feb 23 05:22:09 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756446 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4D586C433FE for ; Wed, 23 Feb 2022 05:28:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238505AbiBWF1w (ORCPT ); Wed, 23 Feb 2022 00:27:52 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58104 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238377AbiBWF0t (ORCPT ); Wed, 23 Feb 2022 00:26:49 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 095BB6D19B for ; Tue, 22 Feb 2022 21:25:10 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id e129-20020a25d387000000b006245d830ca6so12324499ybf.13 for ; Tue, 22 Feb 2022 21:25:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=vfUNGUzwKPwc6P8NLNU+G63bl3V3MlYghQJcmtOSBSE=; b=C/GDnUFZL4LltMsvS5FMOZ/XDx/Be18sR4fVDAdJWkBYRmDcKo9NUb1We9xS3LmtRX yqMgXlP8N7xZF8Ux9+U55HvDcm/rY0cp41ABtNZNYkR8wwwnlp5Tc/Z+WHg/tqCClya8 drGn7Okq0DGr0wllM266iYWMJ6XzagvRVWp/SuheBkEQDwBl5n0u5tK2RWK5glZ9ZfVv eEvwT9dopjeCYjYsqzOL8esps1cNs0pAqvlGBd0CeJIX0vdenIhUxI0GrChVSsd4iQJA j46UlOr9zWrPPZy9HcKzVHRGC9BZ83U6lLLZDxYZVj1AeRVrqyWEuBaLgII5jGj/PpWl RKzQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=vfUNGUzwKPwc6P8NLNU+G63bl3V3MlYghQJcmtOSBSE=; b=qNj/b5WdTsA/gaTIOCt19vcjRTIOXfNx0qHhTxvazPiTo2F5sPtxouZ6HQWmelDnkX Juzu/tUc24IbDfllfwv0/y6GdRVqqrEN44cDFEBVmiwMTXB//NumsdpahXVrj8g9Ucut rVxrOGTikHJBAL7K6lARuEIOWDMnRwrlht4oh8UFT0+hnHCjud50arn1bwQG8srzRIfg RNcMjzxYigjyPa2t7nnLPzTmXdUwuRPSYC7tag3LpdXcioA7fjdJC2OVNx83QBmP7+qd YP46UVYiOEjHZh61oREzXZTTmLajBd950ZRc1QP8qnF5dx1T93XU7TzAWeq4qtTa5cjC 6UGg== X-Gm-Message-State: AOAM5305ZlyD9dW/pKzFPee2b5bl/Tsb8BBpntkqQ1r5diOEXF1BseSx axg6u13wtMwP3OKNErn5Kc7O9BQXPMn7 X-Google-Smtp-Source: ABdhPJzDIIPPMetTAHkgCVvjLqBOhwyQ9fOAQZEm98thqfQ3nuARy106JiYU5crLZ1jrdqAZ9ScyawbA4X0V X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a81:945:0:b0:2ca:287c:6cf3 with SMTP id 66-20020a810945000000b002ca287c6cf3mr26007076ywj.408.1645593899054; Tue, 22 Feb 2022 21:24:59 -0800 (PST) Date: Tue, 22 Feb 2022 21:22:09 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-34-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 33/47] kvm: asi: Map guest memory into restricted ASI address space From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org A module parameter treat_all_userspace_as_nonsensitive is added, which if set, maps the entire userspace of the process running the VM into the ASI restricted address space. If the flag is not set (the default), then just the userspace memory mapped into the VM's address space is mapped into the ASI restricted address space. Signed-off-by: Junaid Shahid --- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/mmu.h | 6 ++++ arch/x86/kvm/mmu/mmu.c | 54 +++++++++++++++++++++++++++++++++ arch/x86/kvm/mmu/paging_tmpl.h | 14 +++++++++ arch/x86/kvm/x86.c | 19 +++++++++++- include/linux/kvm_host.h | 3 ++ virt/kvm/kvm_main.c | 7 +++++ 7 files changed, 104 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 98cbd6447e3e..e63a2f244d7b 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -681,6 +681,8 @@ struct kvm_vcpu_arch { struct kvm_mmu_memory_cache mmu_gfn_array_cache; struct kvm_mmu_memory_cache mmu_page_header_cache; + struct asi_pgtbl_pool asi_pgtbl_pool; + /* * QEMU userspace and the guest each have their own FPU state. * In vcpu_run, we switch between the user and guest FPU contexts. diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 9ae6168d381e..60b84331007d 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -49,6 +49,12 @@ #define KVM_MMU_CR0_ROLE_BITS (X86_CR0_PG | X86_CR0_WP) +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +extern bool treat_all_userspace_as_nonsensitive; +#else +#define treat_all_userspace_as_nonsensitive true +#endif + static __always_inline u64 rsvd_bits(int s, int e) { BUILD_BUG_ON(__builtin_constant_p(e) && __builtin_constant_p(s) && e < s); diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index fcdf3f8bb59a..485c0ba3ce8b 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -91,6 +91,11 @@ __MODULE_PARM_TYPE(nx_huge_pages_recovery_period_ms, "uint"); static bool __read_mostly force_flush_and_sync_on_reuse; module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644); +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +bool __ro_after_init treat_all_userspace_as_nonsensitive; +module_param(treat_all_userspace_as_nonsensitive, bool, 0444); +#endif + /* * When setting this variable to true it enables Two-Dimensional-Paging * where the hardware walks 2 page tables: @@ -2757,6 +2762,21 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot, return ret; } +static void asi_map_gfn_range(struct kvm_vcpu *vcpu, + struct kvm_memory_slot *slot, + gfn_t gfn, size_t npages) +{ + int err; + size_t hva = __gfn_to_hva_memslot(slot, gfn); + + err = asi_map_user(vcpu->kvm->asi, (void *)hva, PAGE_SIZE * npages, + &vcpu->arch.asi_pgtbl_pool, slot->userspace_addr, + slot->userspace_addr + slot->npages * PAGE_SIZE); + if (err) + kvm_err("asi_map_user for %lx-%lx failed with code %d", hva, + hva + PAGE_SIZE * npages, err); +} + static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, u64 *start, u64 *end) @@ -2776,6 +2796,9 @@ static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu, if (ret <= 0) return -1; + if (!treat_all_userspace_as_nonsensitive) + asi_map_gfn_range(vcpu, slot, gfn, ret); + for (i = 0; i < ret; i++, gfn++, start++) { mmu_set_spte(vcpu, slot, start, access, gfn, page_to_pfn(pages[i]), NULL); @@ -3980,6 +4003,15 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, return true; } +static void vcpu_fill_asi_pgtbl_pool(struct kvm_vcpu *vcpu) +{ + int err = asi_fill_pgtbl_pool(&vcpu->arch.asi_pgtbl_pool, + CONFIG_PGTABLE_LEVELS - 1, GFP_KERNEL); + + if (err) + kvm_err("asi_fill_pgtbl_pool failed with code %d", err); +} + /* * Returns true if the page fault is stale and needs to be retried, i.e. if the * root was invalidated by a memslot update or a relevant mmu_notifier fired. @@ -4013,6 +4045,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault bool is_tdp_mmu_fault = is_tdp_mmu(vcpu->arch.mmu); unsigned long mmu_seq; + bool try_asi_map; int r; fault->gfn = fault->addr >> PAGE_SHIFT; @@ -4038,6 +4071,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault if (handle_abnormal_pfn(vcpu, fault, ACC_ALL, &r)) return r; + try_asi_map = !treat_all_userspace_as_nonsensitive && + !is_noslot_pfn(fault->pfn); + + if (try_asi_map) + vcpu_fill_asi_pgtbl_pool(vcpu); + r = RET_PF_RETRY; if (is_tdp_mmu_fault) @@ -4052,6 +4091,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault if (r) goto out_unlock; + if (try_asi_map) + asi_map_gfn_range(vcpu, fault->slot, fault->gfn, 1); + if (is_tdp_mmu_fault) r = kvm_tdp_mmu_map(vcpu, fault); else @@ -5584,6 +5626,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu) vcpu->arch.nested_mmu.translate_gpa = translate_nested_gpa; + asi_init_pgtbl_pool(&vcpu->arch.asi_pgtbl_pool); + ret = __kvm_mmu_create(vcpu, &vcpu->arch.guest_mmu); if (ret) return ret; @@ -5713,6 +5757,15 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm, struct kvm_memory_slot *slot, struct kvm_page_track_notifier_node *node) { + /* + * Currently, we just zap the entire address range, instead of only the + * memslot. So we also just asi_unmap the entire userspace. But in the + * future, if we zap only the range belonging to the memslot, then we + * should also asi_unmap only that range. + */ + if (!treat_all_userspace_as_nonsensitive) + asi_unmap_user(kvm->asi, 0, TASK_SIZE_MAX); + kvm_mmu_zap_all_fast(kvm); } @@ -6194,6 +6247,7 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu) free_mmu_pages(&vcpu->arch.root_mmu); free_mmu_pages(&vcpu->arch.guest_mmu); mmu_free_memory_caches(vcpu); + asi_clear_pgtbl_pool(&vcpu->arch.asi_pgtbl_pool); } void kvm_mmu_module_exit(void) diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 708a5d297fe1..193317ad60a4 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -584,6 +584,9 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, if (is_error_pfn(pfn)) return false; + if (!treat_all_userspace_as_nonsensitive) + asi_map_gfn_range(vcpu, slot, gfn, 1); + mmu_set_spte(vcpu, slot, spte, pte_access, gfn, pfn, NULL); kvm_release_pfn_clean(pfn); return true; @@ -836,6 +839,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault int r; unsigned long mmu_seq; bool is_self_change_mapping; + bool try_asi_map; pgprintk("%s: addr %lx err %x\n", __func__, fault->addr, fault->error_code); WARN_ON_ONCE(fault->is_tdp); @@ -890,6 +894,12 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault if (handle_abnormal_pfn(vcpu, fault, walker.pte_access, &r)) return r; + try_asi_map = !treat_all_userspace_as_nonsensitive && + !is_noslot_pfn(fault->pfn); + + if (try_asi_map) + vcpu_fill_asi_pgtbl_pool(vcpu); + /* * Do not change pte_access if the pfn is a mmio page, otherwise * we will cache the incorrect access into mmio spte. @@ -919,6 +929,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault r = make_mmu_pages_available(vcpu); if (r) goto out_unlock; + + if (try_asi_map) + asi_map_gfn_range(vcpu, fault->slot, walker.gfn, 1); + r = FNAME(fetch)(vcpu, fault, &walker); kvm_mmu_audit(vcpu, AUDIT_POST_PAGE_FAULT); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index dd07f677d084..d0df14deae80 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8722,7 +8722,10 @@ int kvm_arch_init(void *opaque) goto out_free_percpu; if (ops->runtime_ops->flush_sensitive_cpu_state) { - r = asi_register_class("KVM", ASI_MAP_STANDARD_NONSENSITIVE, + r = asi_register_class("KVM", + ASI_MAP_STANDARD_NONSENSITIVE | + (treat_all_userspace_as_nonsensitive ? + ASI_MAP_ALL_USERSPACE : 0), &kvm_asi_hooks); if (r < 0) goto out_mmu_exit; @@ -9675,6 +9678,17 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT); if (start <= apic_address && apic_address < end) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD); + + if (!treat_all_userspace_as_nonsensitive) + asi_unmap_user(kvm->asi, (void *)start, end - start); +} + +void kvm_arch_mmu_notifier_invalidate_range_start(struct kvm *kvm, + unsigned long start, + unsigned long end) +{ + if (!treat_all_userspace_as_nonsensitive) + asi_unmap_user(kvm->asi, (void *)start, end - start); } void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu) @@ -11874,6 +11888,9 @@ void kvm_arch_commit_memory_region(struct kvm *kvm, void kvm_arch_flush_shadow_all(struct kvm *kvm) { + if (!treat_all_userspace_as_nonsensitive) + asi_unmap_user(kvm->asi, 0, TASK_SIZE_MAX); + kvm_mmu_zap_all(kvm); } diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 9dd63ed21f75..f31f7442eced 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1819,6 +1819,9 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp, void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, unsigned long start, unsigned long end); +void kvm_arch_mmu_notifier_invalidate_range_start(struct kvm *kvm, + unsigned long start, + unsigned long end); #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 72c4e6b39389..e8e9c8588908 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -162,6 +162,12 @@ __weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, { } +__weak void kvm_arch_mmu_notifier_invalidate_range_start(struct kvm *kvm, + unsigned long start, + unsigned long end) +{ +} + bool kvm_is_zone_device_pfn(kvm_pfn_t pfn) { /* @@ -685,6 +691,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, spin_unlock(&kvm->mn_invalidate_lock); __kvm_handle_hva_range(kvm, &hva_range); + kvm_arch_mmu_notifier_invalidate_range_start(kvm, range->start, range->end); return 0; } From patchwork Wed Feb 23 05:22:10 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756444 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D3D68C433FE for ; Wed, 23 Feb 2022 05:27:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238339AbiBWF1r (ORCPT ); Wed, 23 Feb 2022 00:27:47 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57754 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238379AbiBWF0u (ORCPT ); Wed, 23 Feb 2022 00:26:50 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9E1CB6D392 for ; Tue, 22 Feb 2022 21:25:12 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id b18-20020a25fa12000000b0062412a8200eso20209666ybe.22 for ; Tue, 22 Feb 2022 21:25:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=ADygLqIPpXiMqBEcZG+FIIc+2sjjDL7sC0SphzjVQM8=; b=XTSLmsGvhvcUg1TWPppXBzPUQLRac82+QtTaWAJbfENnf+LFsbfGfYAYKSS2y7fP86 xw06QgEtf7pep5PqE6Q1aM+NyxahQ7SmgpjVkx80n1ORA4q/bqa1CVOsjpTMMWmZloKQ uucN4SETymYCsva4W1OHKwzmwTJIk9TZAOYPJ1y+ZPVUmyQQ1PbH7n1l3q9WUPnCWwcM Cb3CxmDXmSaVEz0TvEtPSskpBLRm0AocmFGmy3Znq/tmFUH/ahr88pzMeI+oucZvgSN3 VZ4LlqahzSCUtRhfh8h6InSjlirWGXuPDeznUvG9vYUdMzAnIxg/DQM0QaE6Bs8x3s0G p5Tg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=ADygLqIPpXiMqBEcZG+FIIc+2sjjDL7sC0SphzjVQM8=; b=ZRM68Djh+36Vb9PkNCWi/gySYia/x2fnL/HYNKU8gMwGtzW/hGha/4gBtgI3TKpRrF H4dSAEp+5u7CdRU8Dx+NBocuwOR9wKA6XidOBFMSUjaiNuETewB3yhDTA8sm9zF/9iN4 teNDm4ENFS8Syti9VYTmP+wC8oEcs5+m1xTvFj/zuyzTvS+sNy5cs5V1dV44i3wIEOId fNRXpJgwfBl1TWtcEWaVAdkYDvBsQw4rDeaeg+Mj1ckaz++H0zXF69l2jO8Sx290z6dF 8UAX+mzkclom2j9uTED5xuwFmI/RewHk1OxVl58Lqenfu7dDE4XdNQBoSj3fTMDqQ/2D rqJA== X-Gm-Message-State: AOAM5306gq8SetoQsDOU49p30rKkPf+AB8eUjWkQg3PQG1T9w/mwIn8G /MNcbyA/FOn4DfVUTPDzFG6wIPUezJVy X-Google-Smtp-Source: ABdhPJxnEaZg6RVeib/qxGX0ultucEuvlDuM1/3KZuHHbAgoAGzQNWHho88GWzU0EQgaqkKLhw8qsZUiJNi/ X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a81:354f:0:b0:2d0:e91f:c26 with SMTP id c76-20020a81354f000000b002d0e91f0c26mr27033178ywa.318.1645593901360; Tue, 22 Feb 2022 21:25:01 -0800 (PST) Date: Tue, 22 Feb 2022 21:22:10 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-35-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 34/47] kvm: asi: Unmap guest memory from ASI address space when using nested virt From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org L1 guest memory as a whole cannot be considered non-sensitive when an L2 is running. Even if L1 is using its own mitigations, L2 VM Exits could, in theory, bring into the cache some sensitive L1 memory without L1 getting a chance to flush it. For simplicity, we just unmap the entire L1 memory from the ASI restricted address space when nested virtualization is turned on. Though this is overridden if the treat_all_userspace_as_nonsensitive flag is enabled. In the future, we could potentially map some portions of L1 memory which are known to contain non-sensitive memory, which would reduce ASI overhead during nested virtualization. Note that unmapping the guest memory still leaves a slight hole because L2 could also potentially access copies of L1 VCPU registers stored in L0 kernel structures. In the future, this could be mitigated by having a separate ASI address space for each VCPU and treating the associated structures as locally non-sensitive only within that VCPU's ASI address space. Signed-off-by: Junaid Shahid --- arch/x86/include/asm/kvm_host.h | 6 ++++++ arch/x86/kvm/mmu/mmu.c | 10 ++++++++++ arch/x86/kvm/vmx/nested.c | 22 ++++++++++++++++++++++ 3 files changed, 38 insertions(+) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index e63a2f244d7b..8ba88bbcf895 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1200,6 +1200,12 @@ struct kvm_arch { */ struct list_head tdp_mmu_pages; + /* + * Number of VCPUs that have enabled nested virtualization. + * Currently only maintained when ASI is enabled. + */ + int nested_virt_enabled_count; + /* * Protects accesses to the following fields when the MMU lock * is held in read mode: diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 485c0ba3ce8b..5785a0d02558 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -94,6 +94,7 @@ module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644); #ifdef CONFIG_ADDRESS_SPACE_ISOLATION bool __ro_after_init treat_all_userspace_as_nonsensitive; module_param(treat_all_userspace_as_nonsensitive, bool, 0444); +EXPORT_SYMBOL_GPL(treat_all_userspace_as_nonsensitive); #endif /* @@ -2769,6 +2770,15 @@ static void asi_map_gfn_range(struct kvm_vcpu *vcpu, int err; size_t hva = __gfn_to_hva_memslot(slot, gfn); + /* + * For now, we just don't map any guest memory when using nested + * virtualization. In the future, we could potentially map some + * portions of guest memory which are known to contain only memory + * which would be considered non-sensitive. + */ + if (vcpu->kvm->arch.nested_virt_enabled_count) + return; + err = asi_map_user(vcpu->kvm->asi, (void *)hva, PAGE_SIZE * npages, &vcpu->arch.asi_pgtbl_pool, slot->userspace_addr, slot->userspace_addr + slot->npages * PAGE_SIZE); diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 9c941535f78c..0a0092e4102d 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -318,6 +318,14 @@ static void free_nested(struct kvm_vcpu *vcpu) nested_release_evmcs(vcpu); free_loaded_vmcs(&vmx->nested.vmcs02); + + if (cpu_feature_enabled(X86_FEATURE_ASI) && + !treat_all_userspace_as_nonsensitive) { + write_lock(&vcpu->kvm->mmu_lock); + WARN_ON(vcpu->kvm->arch.nested_virt_enabled_count <= 0); + vcpu->kvm->arch.nested_virt_enabled_count--; + write_unlock(&vcpu->kvm->mmu_lock); + } } /* @@ -4876,6 +4884,20 @@ static int enter_vmx_operation(struct kvm_vcpu *vcpu) pt_update_intercept_for_msr(vcpu); } + if (cpu_feature_enabled(X86_FEATURE_ASI) && + !treat_all_userspace_as_nonsensitive) { + /* + * We do the increment under the MMU lock in order to prevent + * it from happening concurrently with asi_map_gfn_range(). + */ + write_lock(&vcpu->kvm->mmu_lock); + WARN_ON(vcpu->kvm->arch.nested_virt_enabled_count < 0); + vcpu->kvm->arch.nested_virt_enabled_count++; + write_unlock(&vcpu->kvm->mmu_lock); + + asi_unmap_user(vcpu->kvm->asi, 0, TASK_SIZE_MAX); + } + return 0; out_shadow_vmcs: From patchwork Wed Feb 23 05:22:11 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756443 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D3079C433EF for ; Wed, 23 Feb 2022 05:27:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238399AbiBWF1k (ORCPT ); Wed, 23 Feb 2022 00:27:40 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57750 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238388AbiBWF0v (ORCPT ); Wed, 23 Feb 2022 00:26:51 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EC7506D3BB for ; Tue, 22 Feb 2022 21:25:14 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id b18-20020a25fa12000000b0062412a8200eso20209728ybe.22 for ; Tue, 22 Feb 2022 21:25:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=z2A7zhGCAXM89Mm8qf+swCLr5Dn/epWs5SXVvIpUsek=; b=tRhCbGDitng8H5bHVni9LeLhw+OutL5exoDLV2q/On1vnMgXPuEEjcg+LbKI1cVqrD Vyo7V47FL7/0MDpSopFI7nReIbg7le3SY0oaOVkKdiMGKdDx7pdDSJXZaaKKRg410xu2 t4RPD/m0Rz3825891oGpbAo2zhoskX0LmYLk/s6S6UydDxkiw0PDjI+nunjVplgqDwrA +5Jz1StT9q9LpFCspfDpRycEYKD4vtMQDfcSS06Robo3QGT5pJP4suYnnRPaHrjD0ty7 l5qUqZO6zEun+3vP4Gykt/1OOPF+J93aOR22CevdnVkqI4Do06KPWQJ6jdbyHleq2VP2 sP7g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=z2A7zhGCAXM89Mm8qf+swCLr5Dn/epWs5SXVvIpUsek=; b=og0g6AyF5+s+ZEcYHKoNF192o1P9dr0p2KSaJiy1dWFO4krd8XmVbGljf9RaggOpde EWePHC8fAyC8jZvayNMHAH69B8eeS7g373AsVQt4GbacDKNPikHiC7yiTlsHpMRie2u7 lEydjmb4PPAu5sSkLwjtwbtngNAjSYr6HfPth606wGJQ1np10ZtV/JHwzfZCc26dH2jY O5a1zRQI5X5SJy8V6nfFGBNWel6ein81zRAhXk6q+xErSqaI4RWvqwsOmqC/+Lu20NH4 7gbyVFNgUBC7j4z1B9tzHcUIokRzFcTkbnRoBYvcvUhUUdwI7uPVGVE1NA3GWfruS7WI 9UAQ== X-Gm-Message-State: AOAM533q5nkPVXhxhnigzBKgxAW0aSp4YJXo8wPjy/c4dgcZzyh8EkKw urAD+QeG1nySnDqCOpZNdeYuFRUatvnq X-Google-Smtp-Source: ABdhPJxQh2hzOmUCmk3wlhtqctWGzojDMn80W5x6yZGJyfVoEE9kdvLHRV4rxoztBsJ+fy5+R2LvtwTwKQ0F X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a0d:eb09:0:b0:2d1:e0df:5104 with SMTP id u9-20020a0deb09000000b002d1e0df5104mr27669944ywe.250.1645593903681; Tue, 22 Feb 2022 21:25:03 -0800 (PST) Date: Tue, 22 Feb 2022 21:22:11 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-36-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 35/47] mm: asi: asi_exit() on PF, skip handling if address is accessible From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: Ofir Weisse , kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Ofir Weisse On a page-fault - do asi_exit(). Then check if now after the exit the address is accessible. We do this by refactoring spurious_kernel_fault() into two parts: 1. Verify that the error code value is something that could arise from a lazy TLB update. 2. Walk the page table and verify permissions, which is now called is_address_accessible_now(). We also define PTE_PRESENT() and PMD_PRESENT() which are suitable for checking userspace pages. For the sake of spurious faualts, pte_present() and pmd_present() are only good for kernelspace pages. This is because these macros might return true even if the present bit is 0 (only relevant for userspace). Signed-off-by: Ofir Weisse --- arch/x86/mm/fault.c | 60 ++++++++++++++++++++++++++++++++++------ include/linux/mm_types.h | 3 ++ 2 files changed, 55 insertions(+), 8 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 8692eb50f4a5..d08021ba380b 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -982,6 +982,8 @@ static int spurious_kernel_fault_check(unsigned long error_code, pte_t *pte) return 1; } +static int is_address_accessible_now(unsigned long error_code, unsigned long address, + pgd_t *pgd); /* * Handle a spurious fault caused by a stale TLB entry. * @@ -1003,15 +1005,13 @@ static int spurious_kernel_fault_check(unsigned long error_code, pte_t *pte) * See Intel Developer's Manual Vol 3 Section 4.10.4.3, bullet 3 * (Optional Invalidation). */ +/* A spurious fault is also possible when Address Space Isolation (ASI) is in + * use. Specifically, code running withing an ASI domain touched memory outside + * the domain. This access causes a page-fault --> asi_exit() */ static noinline int spurious_kernel_fault(unsigned long error_code, unsigned long address) { pgd_t *pgd; - p4d_t *p4d; - pud_t *pud; - pmd_t *pmd; - pte_t *pte; - int ret; /* * Only writes to RO or instruction fetches from NX may cause @@ -1027,6 +1027,37 @@ spurious_kernel_fault(unsigned long error_code, unsigned long address) return 0; pgd = init_mm.pgd + pgd_index(address); + return is_address_accessible_now(error_code, address, pgd); +} +NOKPROBE_SYMBOL(spurious_kernel_fault); + + +/* Check if an address (kernel or userspace) would cause a page fault if + * accessed now. + * + * For kernel addresses, pte_present and pmd_present are sufficioent. For + * userspace, we must use PTE_PRESENT and PMD_PRESENT, which will only check the + * present bits. + * The existing pmd_present() in arch/x86/include/asm/pgtable.h is misleading. + * The PMD page might be in the middle of split_huge_page with present bit + * clear, but pmd_present will still return true. We are inteerested in knowing + * if the page is accessible to hardware - that is - the present bit is 1. */ +#define PMD_PRESENT(pmd) (pmd_flags(pmd) & _PAGE_PRESENT) + +/* pte_present will return true is _PAGE_PROTNONE is 1. We care if the hardware + * can actually access the page right now. */ +#define PTE_PRESENT(pte) (pte_flags(pte) & _PAGE_PRESENT) + +static noinline int +is_address_accessible_now(unsigned long error_code, unsigned long address, + pgd_t *pgd) +{ + p4d_t *p4d; + pud_t *pud; + pmd_t *pmd; + pte_t *pte; + int ret; + if (!pgd_present(*pgd)) return 0; @@ -1045,14 +1076,14 @@ spurious_kernel_fault(unsigned long error_code, unsigned long address) return spurious_kernel_fault_check(error_code, (pte_t *) pud); pmd = pmd_offset(pud, address); - if (!pmd_present(*pmd)) + if (!PMD_PRESENT(*pmd)) return 0; if (pmd_large(*pmd)) return spurious_kernel_fault_check(error_code, (pte_t *) pmd); pte = pte_offset_kernel(pmd, address); - if (!pte_present(*pte)) + if (!PTE_PRESENT(*pte)) return 0; ret = spurious_kernel_fault_check(error_code, pte); @@ -1068,7 +1099,6 @@ spurious_kernel_fault(unsigned long error_code, unsigned long address) return ret; } -NOKPROBE_SYMBOL(spurious_kernel_fault); int show_unhandled_signals = 1; @@ -1504,6 +1534,20 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault) * the fixup on the next page fault. */ struct asi *asi = asi_get_current(); + if (asi) + asi_exit(); + + /* handle_page_fault() might call BUG() if we run it for a kernel + * address. This might be the case if we got here due to an ASI fault. + * We avoid this case by checking whether the address is now, after a + * potential asi_exit(), accessible by hardware. If it is - there's + * nothing to do. + */ + if (current && mm_asi_enabled(current->mm)) { + pgd_t *pgd = (pgd_t*)__va(read_cr3_pa()) + pgd_index(address); + if (is_address_accessible_now(error_code, address, pgd)) + return; + } prefetchw(¤t->mm->mmap_lock); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index c3f209720a84..560909e80841 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -707,6 +707,9 @@ extern struct mm_struct init_mm; #ifdef CONFIG_ADDRESS_SPACE_ISOLATION static inline bool mm_asi_enabled(struct mm_struct *mm) { + if (!mm) + return false; + return mm->asi_enabled; } #else From patchwork Wed Feb 23 05:22:12 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756438 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 06604C433EF for ; Wed, 23 Feb 2022 05:27:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237619AbiBWF1Z (ORCPT ); Wed, 23 Feb 2022 00:27:25 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57936 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238399AbiBWF0w (ORCPT ); Wed, 23 Feb 2022 00:26:52 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AFD596A022 for ; Tue, 22 Feb 2022 21:25:16 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-2d6bca75aa2so141255987b3.18 for ; Tue, 22 Feb 2022 21:25:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=Bg8ARgAvpxqrVZujUdMBpF3O5GREYgeeic1zpLXlEAs=; b=ZkYcibzizevz7miAILMO787yJLGRxl/cgzlUmAaY/wAepMJq6EHrvI2zYY3SBnMhsX y7RWUVRNbxiaiAE1t5uCyYG9PImJ+/9dR0KGIg93jTvqY0XFQDy403oNMPINt668O0wX lN1x9DKfkMN5phGLIKP2guH1oM8X+oF56FSnIjn/s1bc0sFaDAvUiQYbfWLnDYkAyeoI kOK471SB7zkEDfAJTXNkMVhdXnFBGV5vBVLgIoK40R3LZcAduulDDg2qjj+85oLE8w4w JUmk2whW1jd4H91VlFkZkcbARwnTRCiIuS1DGvK4lzbHyv57ehOmB1Yejn5DI2lnROWR Cvjg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=Bg8ARgAvpxqrVZujUdMBpF3O5GREYgeeic1zpLXlEAs=; b=u5gJTC5Enp4ux/om/6HGU/yXl97BZThof4Va/oc25UhwwYHzJ0FDGoRmVFs1pVr04l GH7JLckfiIgcvrZ8pIWTsDF3EdLYkdO/cTmbktE1304um6VaSMrZZaz5GNWV6T14sqUK U1/Y78/nL+1RWEt7onkeiSvNALhQj+Yi1mGUOtYjkjrXNILrOb9QMAwPSc7sGYfadK7v aD9jgLVPGDLipbiwunVZ2/3jEqAv1Cb2RchGhi9GHYf3Oasw2rKIK11QEB6KzYHBocUQ PhDGinMvAOjXaLJ+l0z6uuBnFVOACzcJvK6YcCYiebATfXOT54u2mvTpIWUHoH4fpC0y 448g== X-Gm-Message-State: AOAM53206vD1lCrl0KF19F1YYXZxrU5hJhLIPyXK8wrfw0//ablhLxed Qz7yn5ZRcQhByEG1IeffKxy/KCVsGupH X-Google-Smtp-Source: ABdhPJxyXiC19n49tKMSkgiaHfTQpZqUfpkjxmpDblETVmyKjxfScd3DY0k63b7nH5huwuEYpTcB2HPTnvtH X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a25:6993:0:b0:624:55af:336c with SMTP id e141-20020a256993000000b0062455af336cmr19351739ybc.412.1645593905875; Tue, 22 Feb 2022 21:25:05 -0800 (PST) Date: Tue, 22 Feb 2022 21:22:12 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-37-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 36/47] mm: asi: Adding support for dynamic percpu ASI allocations From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: Ofir Weisse , kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Ofir Weisse Adding infrastructure to support pcpu_alloc with gfp flag of __GFP_GLOBAL_NONSENSITIVE. We use a similar mechanism as the earlier infrastructure for memcg percpu allocations and add pcpu type PCPU_CHUNK_ASI_NONSENSITIVE. pcpu_chunk_list(PCPU_CHUNK_ASI_NONSENSITIVE) will return a list of ASI nonsensitive percpu chunks, allowing most of the code to be unchanged. Signed-off-by: Ofir Weisse --- mm/percpu-internal.h | 23 ++++++- mm/percpu-km.c | 5 +- mm/percpu-vm.c | 6 +- mm/percpu.c | 139 ++++++++++++++++++++++++++++++++++--------- 4 files changed, 141 insertions(+), 32 deletions(-) diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index 639662c20c82..2fac01114edc 100644 --- a/mm/percpu-internal.h +++ b/mm/percpu-internal.h @@ -5,6 +5,15 @@ #include #include +enum pcpu_chunk_type { + PCPU_CHUNK_ROOT, +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + PCPU_CHUNK_ASI_NONSENSITIVE, +#endif + PCPU_NR_CHUNK_TYPES, + PCPU_FAIL_ALLOC = PCPU_NR_CHUNK_TYPES +}; + /* * pcpu_block_md is the metadata block struct. * Each chunk's bitmap is split into a number of full blocks. @@ -59,6 +68,9 @@ struct pcpu_chunk { #ifdef CONFIG_MEMCG_KMEM struct obj_cgroup **obj_cgroups; /* vector of object cgroups */ #endif +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + bool is_asi_nonsensitive; /* ASI nonsensitive chunk */ +#endif int nr_pages; /* # of pages served by this chunk */ int nr_populated; /* # of populated pages */ @@ -68,7 +80,7 @@ struct pcpu_chunk { extern spinlock_t pcpu_lock; -extern struct list_head *pcpu_chunk_lists; +extern struct list_head *pcpu_chunk_lists[PCPU_NR_CHUNK_TYPES]; extern int pcpu_nr_slots; extern int pcpu_sidelined_slot; extern int pcpu_to_depopulate_slot; @@ -113,6 +125,15 @@ static inline int pcpu_chunk_map_bits(struct pcpu_chunk *chunk) return pcpu_nr_pages_to_map_bits(chunk->nr_pages); } +static inline enum pcpu_chunk_type pcpu_chunk_type(struct pcpu_chunk *chunk) +{ +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + if (chunk->is_asi_nonsensitive) + return PCPU_CHUNK_ASI_NONSENSITIVE; +#endif + return PCPU_CHUNK_ROOT; +} + #ifdef CONFIG_PERCPU_STATS #include diff --git a/mm/percpu-km.c b/mm/percpu-km.c index fe31aa19db81..01e31bd55860 100644 --- a/mm/percpu-km.c +++ b/mm/percpu-km.c @@ -50,7 +50,8 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk, /* nada */ } -static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp) +static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type, + gfp_t gfp) { const int nr_pages = pcpu_group_sizes[0] >> PAGE_SHIFT; struct pcpu_chunk *chunk; @@ -58,7 +59,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp) unsigned long flags; int i; - chunk = pcpu_alloc_chunk(gfp); + chunk = pcpu_alloc_chunk(type, gfp); if (!chunk) return NULL; diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c index 5579a96ad782..59f3b55abdd1 100644 --- a/mm/percpu-vm.c +++ b/mm/percpu-vm.c @@ -357,7 +357,8 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk, pcpu_free_pages(chunk, pages, page_start, page_end); } -static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp) +static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type, + gfp_t gfp) { struct pcpu_chunk *chunk; struct vm_struct **vms; @@ -368,7 +369,8 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp) gfp &= ~__GFP_GLOBAL_NONSENSITIVE; - chunk = pcpu_alloc_chunk(gfp); + chunk = pcpu_alloc_chunk(type, gfp); + if (!chunk) return NULL; diff --git a/mm/percpu.c b/mm/percpu.c index f5b2c2ea5a54..beaca5adf9d4 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -172,7 +172,7 @@ struct pcpu_chunk *pcpu_reserved_chunk __ro_after_init; DEFINE_SPINLOCK(pcpu_lock); /* all internal data structures */ static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop, map ext */ -struct list_head *pcpu_chunk_lists __ro_after_init; /* chunk list slots */ +struct list_head *pcpu_chunk_lists[PCPU_NR_CHUNK_TYPES] __ro_after_init; /* chunk list slots */ /* chunks which need their map areas extended, protected by pcpu_lock */ static LIST_HEAD(pcpu_map_extend_chunks); @@ -531,10 +531,12 @@ static void __pcpu_chunk_move(struct pcpu_chunk *chunk, int slot, bool move_front) { if (chunk != pcpu_reserved_chunk) { + struct list_head *pcpu_type_lists = + pcpu_chunk_lists[pcpu_chunk_type(chunk)]; if (move_front) - list_move(&chunk->list, &pcpu_chunk_lists[slot]); + list_move(&chunk->list, &pcpu_type_lists[slot]); else - list_move_tail(&chunk->list, &pcpu_chunk_lists[slot]); + list_move_tail(&chunk->list, &pcpu_type_lists[slot]); } } @@ -570,13 +572,16 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot) static void pcpu_isolate_chunk(struct pcpu_chunk *chunk) { + struct list_head *pcpu_type_lists = + pcpu_chunk_lists[pcpu_chunk_type(chunk)]; + lockdep_assert_held(&pcpu_lock); if (!chunk->isolated) { chunk->isolated = true; pcpu_nr_empty_pop_pages -= chunk->nr_empty_pop_pages; } - list_move(&chunk->list, &pcpu_chunk_lists[pcpu_to_depopulate_slot]); + list_move(&chunk->list, &pcpu_type_lists[pcpu_to_depopulate_slot]); } static void pcpu_reintegrate_chunk(struct pcpu_chunk *chunk) @@ -1438,7 +1443,8 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr, return chunk; } -static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp) +static struct pcpu_chunk *pcpu_alloc_chunk(enum pcpu_chunk_type type, + gfp_t gfp) { struct pcpu_chunk *chunk; int region_bits; @@ -1475,6 +1481,13 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp) goto objcg_fail; } #endif +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + /* TODO: (oweisse) do asi_map for nonsensitive chunks */ + if (type == PCPU_CHUNK_ASI_NONSENSITIVE) + chunk->is_asi_nonsensitive = true; + else + chunk->is_asi_nonsensitive = false; +#endif pcpu_init_md_blocks(chunk); @@ -1580,7 +1593,8 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk, int page_start, int page_end); static void pcpu_post_unmap_tlb_flush(struct pcpu_chunk *chunk, int page_start, int page_end); -static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp); +static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type, + gfp_t gfp); static void pcpu_destroy_chunk(struct pcpu_chunk *chunk); static struct page *pcpu_addr_to_page(void *addr); static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai); @@ -1733,6 +1747,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, unsigned long flags; void __percpu *ptr; size_t bits, bit_align; + enum pcpu_chunk_type type; + struct list_head *pcpu_type_lists; gfp = current_gfp_context(gfp); /* whitelisted flags that can be passed to the backing allocators */ @@ -1763,6 +1779,16 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, if (unlikely(!pcpu_memcg_pre_alloc_hook(size, gfp, &objcg))) return NULL; + type = PCPU_CHUNK_ROOT; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + if (static_asi_enabled() && (gfp & __GFP_GLOBAL_NONSENSITIVE)) { + type = PCPU_CHUNK_ASI_NONSENSITIVE; + pcpu_gfp |= __GFP_GLOBAL_NONSENSITIVE; + } +#endif + pcpu_type_lists = pcpu_chunk_lists[type]; + BUG_ON(!pcpu_type_lists); + if (!is_atomic) { /* * pcpu_balance_workfn() allocates memory under this mutex, @@ -1800,7 +1826,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, restart: /* search through normal chunks */ for (slot = pcpu_size_to_slot(size); slot <= pcpu_free_slot; slot++) { - list_for_each_entry_safe(chunk, next, &pcpu_chunk_lists[slot], + list_for_each_entry_safe(chunk, next, &pcpu_type_lists[slot], list) { off = pcpu_find_block_fit(chunk, bits, bit_align, is_atomic); @@ -1830,8 +1856,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, goto fail; } - if (list_empty(&pcpu_chunk_lists[pcpu_free_slot])) { - chunk = pcpu_create_chunk(pcpu_gfp); + if (list_empty(&pcpu_type_lists[pcpu_free_slot])) { + chunk = pcpu_create_chunk(type, pcpu_gfp); if (!chunk) { err = "failed to allocate new chunk"; goto fail; @@ -1983,12 +2009,19 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align) * CONTEXT: * pcpu_lock (can be dropped temporarily) */ -static void pcpu_balance_free(bool empty_only) + +static void __pcpu_balance_free(bool empty_only, + enum pcpu_chunk_type type) { LIST_HEAD(to_free); - struct list_head *free_head = &pcpu_chunk_lists[pcpu_free_slot]; + struct list_head *pcpu_type_lists = pcpu_chunk_lists[type]; + struct list_head *free_head; struct pcpu_chunk *chunk, *next; + if (!pcpu_type_lists) + return; + free_head = &pcpu_type_lists[pcpu_free_slot]; + lockdep_assert_held(&pcpu_lock); /* @@ -2026,6 +2059,14 @@ static void pcpu_balance_free(bool empty_only) spin_lock_irq(&pcpu_lock); } +static void pcpu_balance_free(bool empty_only) +{ + enum pcpu_chunk_type type; + for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) { + __pcpu_balance_free(empty_only, type); + } +} + /** * pcpu_balance_populated - manage the amount of populated pages * @@ -2038,12 +2079,21 @@ static void pcpu_balance_free(bool empty_only) * CONTEXT: * pcpu_lock (can be dropped temporarily) */ -static void pcpu_balance_populated(void) +static void __pcpu_balance_populated(enum pcpu_chunk_type type) { /* gfp flags passed to underlying allocators */ - const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN; + const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + | (type == PCPU_CHUNK_ASI_NONSENSITIVE ? + __GFP_GLOBAL_NONSENSITIVE : 0) +#endif + ; struct pcpu_chunk *chunk; int slot, nr_to_pop, ret; + struct list_head *pcpu_type_lists = pcpu_chunk_lists[type]; + + if (!pcpu_type_lists) + return; lockdep_assert_held(&pcpu_lock); @@ -2074,7 +2124,7 @@ static void pcpu_balance_populated(void) if (!nr_to_pop) break; - list_for_each_entry(chunk, &pcpu_chunk_lists[slot], list) { + list_for_each_entry(chunk, &pcpu_type_lists[slot], list) { nr_unpop = chunk->nr_pages - chunk->nr_populated; if (nr_unpop) break; @@ -2107,7 +2157,7 @@ static void pcpu_balance_populated(void) if (nr_to_pop) { /* ran out of chunks to populate, create a new one and retry */ spin_unlock_irq(&pcpu_lock); - chunk = pcpu_create_chunk(gfp); + chunk = pcpu_create_chunk(type, gfp); cond_resched(); spin_lock_irq(&pcpu_lock); if (chunk) { @@ -2117,6 +2167,14 @@ static void pcpu_balance_populated(void) } } +static void pcpu_balance_populated() +{ + enum pcpu_chunk_type type; + + for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) + __pcpu_balance_populated(type); +} + /** * pcpu_reclaim_populated - scan over to_depopulate chunks and free empty pages * @@ -2132,13 +2190,19 @@ static void pcpu_balance_populated(void) * pcpu_lock (can be dropped temporarily) * */ -static void pcpu_reclaim_populated(void) + + +static void __pcpu_reclaim_populated(enum pcpu_chunk_type type) { struct pcpu_chunk *chunk; struct pcpu_block_md *block; int freed_page_start, freed_page_end; int i, end; bool reintegrate; + struct list_head *pcpu_type_lists = pcpu_chunk_lists[type]; + + if (!pcpu_type_lists) + return; lockdep_assert_held(&pcpu_lock); @@ -2148,8 +2212,8 @@ static void pcpu_reclaim_populated(void) * other accessor is the free path which only returns area back to the * allocator not touching the populated bitmap. */ - while (!list_empty(&pcpu_chunk_lists[pcpu_to_depopulate_slot])) { - chunk = list_first_entry(&pcpu_chunk_lists[pcpu_to_depopulate_slot], + while (!list_empty(&pcpu_type_lists[pcpu_to_depopulate_slot])) { + chunk = list_first_entry(&pcpu_type_lists[pcpu_to_depopulate_slot], struct pcpu_chunk, list); WARN_ON(chunk->immutable); @@ -2219,10 +2283,18 @@ static void pcpu_reclaim_populated(void) pcpu_reintegrate_chunk(chunk); else list_move_tail(&chunk->list, - &pcpu_chunk_lists[pcpu_sidelined_slot]); + &pcpu_type_lists[pcpu_sidelined_slot]); } } +static void pcpu_reclaim_populated(void) +{ + enum pcpu_chunk_type type; + for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) { + __pcpu_reclaim_populated(type); + } +} + /** * pcpu_balance_workfn - manage the amount of free chunks and populated pages * @work: unused @@ -2268,6 +2340,7 @@ void free_percpu(void __percpu *ptr) unsigned long flags; int size, off; bool need_balance = false; + struct list_head *pcpu_type_lists = NULL; if (!ptr) return; @@ -2280,6 +2353,8 @@ void free_percpu(void __percpu *ptr) chunk = pcpu_chunk_addr_search(addr); off = addr - chunk->base_addr; + pcpu_type_lists = pcpu_chunk_lists[pcpu_chunk_type(chunk)]; + BUG_ON(!pcpu_type_lists); size = pcpu_free_area(chunk, off); @@ -2293,7 +2368,7 @@ void free_percpu(void __percpu *ptr) if (!chunk->isolated && chunk->free_bytes == pcpu_unit_size) { struct pcpu_chunk *pos; - list_for_each_entry(pos, &pcpu_chunk_lists[pcpu_free_slot], list) + list_for_each_entry(pos, &pcpu_type_lists[pcpu_free_slot], list) if (pos != chunk) { need_balance = true; break; @@ -2601,6 +2676,7 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, int map_size; unsigned long tmp_addr; size_t alloc_size; + enum pcpu_chunk_type type; #define PCPU_SETUP_BUG_ON(cond) do { \ if (unlikely(cond)) { \ @@ -2723,15 +2799,24 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, pcpu_free_slot = pcpu_sidelined_slot + 1; pcpu_to_depopulate_slot = pcpu_free_slot + 1; pcpu_nr_slots = pcpu_to_depopulate_slot + 1; - pcpu_chunk_lists = memblock_alloc(pcpu_nr_slots * - sizeof(pcpu_chunk_lists[0]), + for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) { +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + if (type == PCPU_CHUNK_ASI_NONSENSITIVE && + !static_asi_enabled()) { + pcpu_chunk_lists[type] = NULL; + continue; + } +#endif + pcpu_chunk_lists[type] = memblock_alloc(pcpu_nr_slots * + sizeof(pcpu_chunk_lists[0][0]), SMP_CACHE_BYTES); - if (!pcpu_chunk_lists) - panic("%s: Failed to allocate %zu bytes\n", __func__, - pcpu_nr_slots * sizeof(pcpu_chunk_lists[0])); + if (!pcpu_chunk_lists[type]) + panic("%s: Failed to allocate %zu bytes\n", __func__, + pcpu_nr_slots * sizeof(pcpu_chunk_lists[0][0])); - for (i = 0; i < pcpu_nr_slots; i++) - INIT_LIST_HEAD(&pcpu_chunk_lists[i]); + for (i = 0; i < pcpu_nr_slots; i++) + INIT_LIST_HEAD(&pcpu_chunk_lists[type][i]); + } /* * The end of the static region needs to be aligned with the From patchwork Wed Feb 23 05:22:13 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756442 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 23990C4332F for ; Wed, 23 Feb 2022 05:27:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238378AbiBWF1i (ORCPT ); Wed, 23 Feb 2022 00:27:38 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57960 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238403AbiBWF0w (ORCPT ); Wed, 23 Feb 2022 00:26:52 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CC56E6A06A for ; Tue, 22 Feb 2022 21:25:17 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-2d726bd83a2so91090537b3.20 for ; Tue, 22 Feb 2022 21:25:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=MBcUE/GjgKAOqkE7k3I16nNGFc2xdrG3rcZqdKaqU1M=; b=huyqkiXgbsdUUaeO2YcSpDguTaIsD2ZiHBttX5Znp2Wu7SotB1quckmXNNZhq3hhVZ CvfghBA5TRiyV7nNwREBXze4Ut4RZl9/dXE69gre6OrHccOhGsM1R39ve/3Fg95dFOoY BvDeP6788/BG9zG4uemrkiHvTMwkxVsftUJqZuGB4ivr+/aWtu4qk7dIqmSXEwo1GhxB UQrfqBGsjla2ZXhbWoUWiWOKjoE4zvV4OfIuCxJHeuNkl1zKwODoNYWYiaOMSwj9wOKz BS0K9yywZ4zV80Is1LL2lLof7xLiZKUcwJTWCzQDyX5SOghoozr2KNat8Mt1uT82lph4 NHIQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=MBcUE/GjgKAOqkE7k3I16nNGFc2xdrG3rcZqdKaqU1M=; b=CCNClOY5ivJyj901263dC2omVSotV5hqV6gdyjNOLUaTKUkM2OUCEP5AMBc3FYb79z HaXDr5oKaCX1gHmcNPJ3Ng6eX45mbZaRqszuQxIdK2GhLHHMOXumkaPY3cDP/Ea56luU gVpcAmFgL3Vd63EEpFw3BP+jyBuZEt+X8WHXehbsx5opjY5WCFOuYnxetpV+Cy/d1pF2 yse3DQ5sM+HMG10PRBnixnYdbe5/mhpE4HC7KL3oVIg/C+fBMPCBKAJ1xjoq9S+TTwEn RC2CX62JM9xhdF8HMzVhu3VXpYL/2OAkhm2rq4IxzisNTXNPxPWIrBERVdC/NYwA4Z4G VAAQ== X-Gm-Message-State: AOAM533PzHrMuc5dJTivW/WgBcOPcrk/jbMEeIgtS/aDRBuznX4X0C+o YwCM5VhL+uJ0PGAVYPINiVEYP8inTCCN X-Google-Smtp-Source: ABdhPJzeh7uokPy0ce1B9Y+L+/uH+xpWq2sOY5OlnwbNrO2jfXStVDw+VQrszqUGfmnR9pf7y3TYhwx7yyIZ X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a81:7d56:0:b0:2d6:90d9:770c with SMTP id y83-20020a817d56000000b002d690d9770cmr26589608ywc.277.1645593908116; Tue, 22 Feb 2022 21:25:08 -0800 (PST) Date: Tue, 22 Feb 2022 21:22:13 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-38-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 37/47] mm: asi: ASI annotation support for static variables. From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: Ofir Weisse , kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Ofir Weisse Added the following annotations: __asi_not_sensitive: for static variables which are considered not sensitive. __asi_not_sensitive_readmostly: similar to __read_mostly, for non-sensitive static variables. Signed-off-by: Ofir Weisse --- arch/x86/include/asm/asi.h | 12 ++++++++++++ include/asm-generic/asi.h | 6 ++++++ include/asm-generic/vmlinux.lds.h | 18 +++++++++++++++++- 3 files changed, 35 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h index bdb2f70d4f85..6dd9c7c8a2b8 100644 --- a/arch/x86/include/asm/asi.h +++ b/arch/x86/include/asm/asi.h @@ -177,6 +177,18 @@ static inline pgd_t *asi_pgd(struct asi *asi) return asi->pgd; } +/* IMPORTANT: Any modification to the name here should also be applied to + * include/asm-generic/vmlinux.lds.h */ +#define ASI_NON_SENSITIVE_SECTION_NAME ".data..asi_non_sensitive" +#define ASI_NON_SENSITIVE_READ_MOSTLY_SECTION_NAME \ + ".data..asi_non_sensitive_readmostly" + +#define __asi_not_sensitive \ + __section(ASI_NON_SENSITIVE_SECTION_NAME) + +#define __asi_not_sensitive_readmostly \ + __section(ASI_NON_SENSITIVE_READ_MOSTLY_SECTION_NAME) + #else /* CONFIG_ADDRESS_SPACE_ISOLATION */ static inline void asi_intr_enter(void) { } diff --git a/include/asm-generic/asi.h b/include/asm-generic/asi.h index fffb323d2a00..d9082267a5dd 100644 --- a/include/asm-generic/asi.h +++ b/include/asm-generic/asi.h @@ -121,6 +121,12 @@ void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len) { } #define static_asi_enabled() false +/* IMPORTANT: Any modification to the name here should also be applied to + * include/asm-generic/vmlinux.lds.h */ + +#define __asi_not_sensitive +#define __asi_not_sensitive_readmostly + #endif /* !_ASSEMBLY_ */ #endif /* !CONFIG_ADDRESS_SPACE_ISOLATION */ diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index 42f3866bca69..c769d939c15f 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -374,10 +374,26 @@ . = ALIGN(PAGE_SIZE); \ __nosave_end = .; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +#define ASI_NOT_SENSITIVE_DATA(page_align) \ + . = ALIGN(page_align); \ + __start_asi_nonsensitive = .; \ + *(.data..asi_non_sensitive) \ + . = ALIGN(page_align); \ + __end_asi_nonsensitive = .; \ + __start_asi_nonsensitive_readmostly = .; \ + *(.data..asi_non_sensitive_readmostly) \ + . = ALIGN(page_align); \ + __end_asi_nonsensitive_readmostly = .; +#else +#define ASI_NOT_SENSITIVE_DATA +#endif + #define PAGE_ALIGNED_DATA(page_align) \ . = ALIGN(page_align); \ *(.data..page_aligned) \ - . = ALIGN(page_align); + . = ALIGN(page_align); \ + ASI_NOT_SENSITIVE_DATA(page_align) #define READ_MOSTLY_DATA(align) \ . = ALIGN(align); \ From patchwork Wed Feb 23 05:22:14 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756441 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0CEDCC433F5 for ; Wed, 23 Feb 2022 05:27:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238369AbiBWF1g (ORCPT ); Wed, 23 Feb 2022 00:27:36 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57816 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238417AbiBWF0x (ORCPT ); Wed, 23 Feb 2022 00:26:53 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 648E66B092 for ; Tue, 22 Feb 2022 21:25:20 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id z15-20020a25bb0f000000b00613388c7d99so26694300ybg.8 for ; Tue, 22 Feb 2022 21:25:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=KXrCYGgvo83JEgYfSU2xPmsuGvCpoOVuzDhVbsTzD0I=; b=peKSlY/Xr61DvdFpVmXc37FMKYnl+vkSCRtwe1UF1UGbHZ0v3cMEdZbiKKvXTZ5GbO laOcMzhj8sBkgNnt1mc4a0NZR0hCQ3gPAdN/2qIaNw43JOyvktEB0Y0rn3QcamgO9gZ8 7DbNXkvTjq6TGSgTmAj628OIdjZHlZYdFsABsewAF21y3xPSWYsHQAWzziaNW39gFRL5 yWxZv0skPs5ax5wbHVY8sW6jeIPlZLUx/HQpjPftWbdPu83nt4wFjKis+ce33NEw7HrY 03O9DgYBb/75g6K+v8UjyCDA3xzr7E/F2srmJjBiOtEMdOtKjPJnG/mlJ872cZ/zgYG/ 2wwQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=KXrCYGgvo83JEgYfSU2xPmsuGvCpoOVuzDhVbsTzD0I=; b=7lKvnq9hoNdNTQrUft9jZcMPAyZLXr8Vunu98Aw9GRiyPt4e3QPEdV26eZYkMGZB3I NchGQ5AjQBG6KD8etXoM26Ql4LX5OvZalCq2AZjvYpp49pw1JWltp4uLayaEy/6BS3NH tFhxOpOZ3AouKe90YugIhRKAsMf4AANCBnPp3DbonBDpQMOJ5lQDMKhWffOG6NHRPH93 mBOULNQMoC6tRz0XI5EDAccw3MYUe4644MiYLRGm4dW09lSZM2m9COlDuW5eemqfMUcm 6XMXT8hohoIV2pPw74nL9J0eH7lDwbBd/OSQOeVbZQDpHB0mUCPT2CfyNBsAl5RZaPmU +DNw== X-Gm-Message-State: AOAM532aAHElHExnnWB2aooow9N5loRssGITB3tR5afbhDfyKRgy1cwP SI+4xSHDsvAaio6YukvL5+ggCo02BC1K X-Google-Smtp-Source: ABdhPJyZfLfl11T/5CrHmveifdlj9QKZBHoh8PI3AMF6sP4DdsodfKmQ8jHyUSfS7xsGTotzIsjTTydYzCuX X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a81:c607:0:b0:2cb:a34a:355c with SMTP id l7-20020a81c607000000b002cba34a355cmr27125747ywi.487.1645593910227; Tue, 22 Feb 2022 21:25:10 -0800 (PST) Date: Tue, 22 Feb 2022 21:22:14 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-39-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 38/47] mm: asi: ASI annotation support for dynamic modules. From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: Ofir Weisse , kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Ofir Weisse Adding support for use of ASI static variable annotations in dynamic modules: - __asi_not_sensitive and - __asi_not_sensitive_readmostly Per module, we now have the following offsets: 1. asi_section_offset/size - which should be mapped into asi global pool 2. asi_readmostly_section/size - same as above, for read mostly data; 3. once_section_offset/size - is considered asi non-sensitive Signed-off-by: Ofir Weisse --- arch/x86/include/asm/asi.h | 3 ++ arch/x86/mm/asi.c | 66 ++++++++++++++++++++++++++++++++++++++ include/asm-generic/asi.h | 3 ++ include/linux/module.h | 9 ++++++ kernel/module.c | 58 +++++++++++++++++++++++++++++++++ 5 files changed, 139 insertions(+) diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h index 6dd9c7c8a2b8..d43f6aadffee 100644 --- a/arch/x86/include/asm/asi.h +++ b/arch/x86/include/asm/asi.h @@ -98,6 +98,9 @@ static inline void asi_init_thread_state(struct thread_struct *thread) thread->intr_nest_depth = 0; } +int asi_load_module(struct module* module); +void asi_unload_module(struct module* module); + static inline void asi_set_target_unrestricted(void) { if (static_cpu_has(X86_FEATURE_ASI)) { diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c index 9b1bd005f343..6c14aa1fc4aa 100644 --- a/arch/x86/mm/asi.c +++ b/arch/x86/mm/asi.c @@ -5,6 +5,7 @@ #include #include #include +#include #include #include @@ -308,6 +309,71 @@ static int __init set_asi_param(char *str) } early_param("asi", set_asi_param); +/* asi_load_module() is called from layout_and_allocate() in kernel/module.c + * We map the module and its data in init_mm.asi_pgd[0]. +*/ +int asi_load_module(struct module* module) +{ + int err = 0; + + /* Map the cod/text */ + err = asi_map(ASI_GLOBAL_NONSENSITIVE, + module->core_layout.base, + module->core_layout.ro_after_init_size ); + if (err) + return err; + + /* Map global variables annotated as non-sensitive for ASI */ + err = asi_map(ASI_GLOBAL_NONSENSITIVE, + module->core_layout.base + + module->core_layout.asi_section_offset, + module->core_layout.asi_section_size ); + if (err) + return err; + + /* Map global variables annotated as non-sensitive for ASI */ + err = asi_map(ASI_GLOBAL_NONSENSITIVE, + module->core_layout.base + + module->core_layout.asi_readmostly_section_offset, + module->core_layout.asi_readmostly_section_size); + if (err) + return err; + + /* Map .data.once section as well */ + err = asi_map(ASI_GLOBAL_NONSENSITIVE, + module->core_layout.base + + module->core_layout.once_section_offset, + module->core_layout.once_section_size ); + if (err) + return err; + + return 0; +} +EXPORT_SYMBOL_GPL(asi_load_module); + +void asi_unload_module(struct module* module) +{ + asi_unmap(ASI_GLOBAL_NONSENSITIVE, + module->core_layout.base, + module->core_layout.ro_after_init_size, true); + + asi_unmap(ASI_GLOBAL_NONSENSITIVE, + module->core_layout.base + + module->core_layout.asi_section_offset, + module->core_layout.asi_section_size, true); + + asi_unmap(ASI_GLOBAL_NONSENSITIVE, + module->core_layout.base + + module->core_layout.asi_readmostly_section_offset, + module->core_layout.asi_readmostly_section_size, true); + + asi_unmap(ASI_GLOBAL_NONSENSITIVE, + module->core_layout.base + + module->core_layout.once_section_offset, + module->core_layout.once_section_size, true); + +} + static int __init asi_global_init(void) { uint i, n; diff --git a/include/asm-generic/asi.h b/include/asm-generic/asi.h index d9082267a5dd..2763cb1a974c 100644 --- a/include/asm-generic/asi.h +++ b/include/asm-generic/asi.h @@ -120,6 +120,7 @@ void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len) { } #define static_asi_enabled() false +static inline int asi_load_module(struct module* module) {return 0;} /* IMPORTANT: Any modification to the name here should also be applied to * include/asm-generic/vmlinux.lds.h */ @@ -127,6 +128,8 @@ void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len) { } #define __asi_not_sensitive #define __asi_not_sensitive_readmostly +static inline void asi_unload_module(struct module* module) { } + #endif /* !_ASSEMBLY_ */ #endif /* !CONFIG_ADDRESS_SPACE_ISOLATION */ diff --git a/include/linux/module.h b/include/linux/module.h index c9f1200b2312..82267a95f936 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -336,6 +336,15 @@ struct module_layout { #ifdef CONFIG_MODULES_TREE_LOOKUP struct mod_tree_node mtn; #endif + +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + unsigned int asi_section_offset; + unsigned int asi_section_size; + unsigned int asi_readmostly_section_offset; + unsigned int asi_readmostly_section_size; + unsigned int once_section_offset; + unsigned int once_section_size; +#endif }; #ifdef CONFIG_MODULES_TREE_LOOKUP diff --git a/kernel/module.c b/kernel/module.c index 84a9141a5e15..d363b8a0ee24 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -2159,6 +2159,8 @@ static void free_module(struct module *mod) { trace_module_free(mod); + asi_unload_module(mod); + mod_sysfs_teardown(mod); /* @@ -2416,6 +2418,31 @@ static bool module_init_layout_section(const char *sname) return module_init_section(sname); } +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +static void asi_record_sections_layout(struct module *mod, + const char *sname, + Elf_Shdr *s) +{ + if (strstarts(sname, ASI_NON_SENSITIVE_READ_MOSTLY_SECTION_NAME)) { + mod->core_layout.asi_readmostly_section_offset = s->sh_entsize; + mod->core_layout.asi_readmostly_section_size = s->sh_size; + } + else if (strstarts(sname, ASI_NON_SENSITIVE_SECTION_NAME)) { + mod->core_layout.asi_section_offset = s->sh_entsize; + mod->core_layout.asi_section_size = s->sh_size; + } + if (strstarts(sname, ".data.once")) { + mod->core_layout.once_section_offset = s->sh_entsize; + mod->core_layout.once_section_size = s->sh_size; + } +} +#else +static void asi_record_sections_layout(struct module *mod, + const char *sname, + Elf_Shdr *s) +{} +#endif + /* * Lay out the SHF_ALLOC sections in a way not dissimilar to how ld * might -- code, read-only data, read-write data, small data. Tally @@ -2453,6 +2480,7 @@ static void layout_sections(struct module *mod, struct load_info *info) || module_init_layout_section(sname)) continue; s->sh_entsize = get_offset(mod, &mod->core_layout.size, s, i); + asi_record_sections_layout(mod, sname, s); pr_debug("\t%s\n", sname); } switch (m) { @@ -3558,6 +3586,25 @@ static bool blacklisted(const char *module_name) } core_param(module_blacklist, module_blacklist, charp, 0400); +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +static void asi_fix_section_size_and_alignment(struct load_info *info, + char *section_to_fix) +{ + unsigned int ndx = find_sec(info, section_to_fix ); + if (!ndx) + return; + + info->sechdrs[ndx].sh_addralign = PAGE_SIZE; + info->sechdrs[ndx].sh_size = + ALIGN( info->sechdrs[ndx].sh_size, PAGE_SIZE ); +} +#else +static inline void asi_fix_section_size_and_alignment(struct load_info *info, + char *section_to_fix) +{} +#endif + + static struct module *layout_and_allocate(struct load_info *info, int flags) { struct module *mod; @@ -3600,6 +3647,15 @@ static struct module *layout_and_allocate(struct load_info *info, int flags) if (ndx) info->sechdrs[ndx].sh_flags |= SHF_RO_AFTER_INIT; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + /* These are sections we will want to map into an ASI page-table. We + * therefore need these sections to be aligned to a PAGE_SIZE */ + asi_fix_section_size_and_alignment(info, ASI_NON_SENSITIVE_SECTION_NAME); + asi_fix_section_size_and_alignment(info, + ASI_NON_SENSITIVE_READ_MOSTLY_SECTION_NAME); + asi_fix_section_size_and_alignment(info, ".data.once"); +#endif + /* * Determine total sizes, and put offsets in sh_entsize. For now * this is done generically; there doesn't appear to be any @@ -4127,6 +4183,8 @@ static int load_module(struct load_info *info, const char __user *uargs, /* Get rid of temporary copy. */ free_copy(info); + asi_load_module(mod); + /* Done! */ trace_module_load(mod); From patchwork Wed Feb 23 05:22:15 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756440 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D16CFC433FE for ; Wed, 23 Feb 2022 05:27:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238282AbiBWF1c (ORCPT ); Wed, 23 Feb 2022 00:27:32 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58086 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238418AbiBWF0y (ORCPT ); Wed, 23 Feb 2022 00:26:54 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 60BBC6BDE2 for ; Tue, 22 Feb 2022 21:25:22 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-2d07ae1145aso162074087b3.4 for ; Tue, 22 Feb 2022 21:25:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=Kqez4sz2gXA92s/yKdI0Sadgd/VKyd58nC75f61sH3c=; b=kY6HPVZ5+V/4ADSj5gTm7IQPleGSUbrDBXkArXIMxRp8wv39Fiygf+bF+YqLodOTpp AZsLSr13DzroNgqyc8IDfXzcQUL9DyFnAH4zuxHj+73S9iVBPAI2bhK64r7+IZ9iPw2j l6U45p4hnLVXIe3HFpHe4qJp9mEpOsCgXDx6xEMFZlJGN4bxSQMLkshnkvHzcWhTYszS 3RalZYOgTpQ3xR1RbA6oUKLNH9FzKB/Oq6geEXx6xglDHFlrVYGZZaZlqsNdHMvuOFaS qOuzTGhwZjX65TLhLW2uhuxlSD1GB251cNcp9nHh4Mb+0b/OHXDC3ll2brbPttKVPhId WuIg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=Kqez4sz2gXA92s/yKdI0Sadgd/VKyd58nC75f61sH3c=; b=0IHfklyOYvERuCvKcrLwM5tSWU6tdZYszVJ34LXb6edAkDn+JwcxMYK5AThY17sc+j ZwnscXl1gsBI/dvbKCHgITruDUdE4Fwr7l8wqyoQMjtluB27T1M9CZhq6Hek8UHZRyhL gq2rCGgXLB3ahCSjEBU5UMEq7kkKoS8jy9gniv3BXA1OIXthLmnsMSYfbbbdlE76lgzo 7RMSkFTHA1hItco8uy53BZgHHJd1H4Zg5AeZRN7ChDUaWggRrPhQ2iYnZMNJLnBRW3Kr I+7Z7/IA9+5FthWGdXIgOcFa/hVpriRncslDBIVsPHKrwv5PSL6BCgkvIsol8qAXeT20 E3+Q== X-Gm-Message-State: AOAM530UUyCQDUWyPqJmSnUuGi2XN01n01SXrHAlkGtmBwPoRbhh6POR RegQvYXg0Cl4i4mVCkqL9zdRU4okp9Q/ X-Google-Smtp-Source: ABdhPJwRpDGu9uCaO7SUVy/qucopgnc8bRss4/HLiRU4Az2YJmA6Oz65Xuo4RY/0idM9yle+QHSTEPNKFl9M X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a25:34c9:0:b0:623:fc5f:b98 with SMTP id b192-20020a2534c9000000b00623fc5f0b98mr27190113yba.195.1645593912355; Tue, 22 Feb 2022 21:25:12 -0800 (PST) Date: Tue, 22 Feb 2022 21:22:15 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-40-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 39/47] mm: asi: Skip conventional L1TF/MDS mitigations From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: Ofir Weisse , kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Ofir Weisse If ASI is enabled for an mm, then the L1D flushes and MDS mitigations will be taken care of ASI. We check if asi is enabled by checking current->mm->asi_enabled. To use ASI, a cgroup flag must be set before the VM process is forked - causing a flag mm->asi_enabled to be set. Signed-off-by: Ofir Weisse --- arch/x86/kvm/vmx/vmx.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index e0178b57be75..6549fef39f2b 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6609,7 +6609,11 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu, kvm_guest_enter_irqoff(); - vmx_flush_sensitive_cpu_state(vcpu); + /* If Address Space Isolation is enabled, it will take care of L1D + * flushes, and will also mitigate MDS. In other words, if no ASI - + * flush sensitive cpu state. */ + if (!static_asi_enabled() || !mm_asi_enabled(current->mm)) + vmx_flush_sensitive_cpu_state(vcpu); asi_enter(vcpu->kvm->asi); From patchwork Wed Feb 23 05:22:16 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756439 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 83953C433F5 for ; Wed, 23 Feb 2022 05:27:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238361AbiBWF1b (ORCPT ); Wed, 23 Feb 2022 00:27:31 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56900 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238369AbiBWF1Q (ORCPT ); Wed, 23 Feb 2022 00:27:16 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B8A8D6C1F1 for ; Tue, 22 Feb 2022 21:25:23 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id b12-20020a056902030c00b0061d720e274aso26588090ybs.20 for ; Tue, 22 Feb 2022 21:25:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=AJ9oF4MxpZqgNWeQmqV576xo41knbB79umnOaNmXddY=; b=gDjBd/x5dYNgFpVj+6AXYinaFyQ0y0nY9xCgZVbh0gj9MDv8hxw/demmgSf5+q0lIl f5u9TMyC094RJt+uGxErIIL6jEyzHxyof4Xb7WeyC6uYfiR8WIq5qp+TlkXog/JDvZk7 2mtgyW1bRYWB4o+N58ElEO1CyDNRbfSatixMLbQR1FnYuk0Gfj5u+Ct8tTOprcSTTI3D O9Kk/stLoRds2nPZbzZ/5aJHis4MCrOVxHj7FYqsiO1KWp6tTJxhsj5RQHjcB5DUXElF yIJbJ9KqvnvVAuRlFRGpGXt2ubrzQb0wSafap3WMiTiEiFUzbC7xRedGoDqwKzVo3u9S eXvA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=AJ9oF4MxpZqgNWeQmqV576xo41knbB79umnOaNmXddY=; b=3Kves+PRFfbF06kf4QiDJwpFGuoGKjrLcwsn1yJZnz37bV9CfgykwZ3xilX9fxGcwV Isag8D3yrAdhmjxqp/KSkxsVKlJ7J8JI63b1IqDJwv5QnnfAl253b86kJIL8FAFXYgO4 bdXBPkK8SJ/BoSZvgsbNOTsuIaUiKxgnnJdqe8YWJw/8hWLT/FgIRmquZT8aRbNZWpqA RMfLGhYCHwJoqqUO/Yx6Mr9LyLCrRLV+OUoUQO961olwRJdNay1e3D9GsQwc83jYE0+T 8J6OFsrf3+D79fisfQb9aYTqnqlIpUNGiPoUYOnZYKIDqOaf5Z9DxAHLSxFAi7d0mQOf 5kEg== X-Gm-Message-State: AOAM530gZWmT0Gkzwjyaf1alU0RlUCJzlh0C3MnQinqByrqQkmbPGRnH K8xOqSE5Y0bXbazMyoPuUZ3GC38oRZyb X-Google-Smtp-Source: ABdhPJxXcU94ggDYk55LTym0qr5/ST37jyZ7dcNnih2o73ZZ8KtzSOHKd++I3ohJEg/v0wMLNTTLtdFeGJBD X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a81:3dcb:0:b0:2ca:72dd:904c with SMTP id k194-20020a813dcb000000b002ca72dd904cmr28454275ywa.290.1645593914646; Tue, 22 Feb 2022 21:25:14 -0800 (PST) Date: Tue, 22 Feb 2022 21:22:16 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-41-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 40/47] mm: asi: support for static percpu DEFINE_PER_CPU*_ASI From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: Ofir Weisse , kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Ofir Weisse Implemented the following PERCPU static declarations: - DECLARE/DEFINE_PER_CPU_ASI_NOT_SENSITIVE - DECLARE/DEFINE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE - DECLARE/DEFINE_PER_CPU_ALIGNED_ASI_NOT_SENSITIVE - DECLARE/DEFINE_PER_CPU_PAGE_ALIGNED_ASI_NOT_SENSITIVE These definitions are also supported in dynamic modules. To support percpu variables in dynamic modules, we're creating an ASI pcpu reserved chunk. The reserved size PERCPU_MODULE_RESERVE is now split between the normal reserved chunk and the ASI one. Signed-off-by: Ofir Weisse --- arch/x86/mm/asi.c | 39 +++++++- include/asm-generic/percpu.h | 6 ++ include/asm-generic/vmlinux.lds.h | 5 + include/linux/module.h | 6 ++ include/linux/percpu-defs.h | 39 ++++++++ include/linux/percpu.h | 8 +- kernel/module-internal.h | 1 + kernel/module.c | 154 ++++++++++++++++++++++++++---- mm/percpu.c | 134 ++++++++++++++++++++++---- 9 files changed, 356 insertions(+), 36 deletions(-) diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c index 6c14aa1fc4aa..ba373b461855 100644 --- a/arch/x86/mm/asi.c +++ b/arch/x86/mm/asi.c @@ -309,6 +309,32 @@ static int __init set_asi_param(char *str) } early_param("asi", set_asi_param); +static int asi_map_percpu(struct asi *asi, void *percpu_addr, size_t len) +{ + int cpu, err; + void *ptr; + + for_each_possible_cpu(cpu) { + ptr = per_cpu_ptr(percpu_addr, cpu); + err = asi_map(asi, ptr, len); + if (err) + return err; + } + + return 0; +} + +static void asi_unmap_percpu(struct asi *asi, void *percpu_addr, size_t len) +{ + int cpu; + void *ptr; + + for_each_possible_cpu(cpu) { + ptr = per_cpu_ptr(percpu_addr, cpu); + asi_unmap(asi, ptr, len, true); + } +} + /* asi_load_module() is called from layout_and_allocate() in kernel/module.c * We map the module and its data in init_mm.asi_pgd[0]. */ @@ -347,7 +373,13 @@ int asi_load_module(struct module* module) if (err) return err; - return 0; + err = asi_map_percpu(ASI_GLOBAL_NONSENSITIVE, + module->percpu_asi, + module->percpu_asi_size ); + if (err) + return err; + + return 0; } EXPORT_SYMBOL_GPL(asi_load_module); @@ -372,6 +404,9 @@ void asi_unload_module(struct module* module) module->core_layout.once_section_offset, module->core_layout.once_section_size, true); + asi_unmap_percpu(ASI_GLOBAL_NONSENSITIVE, module->percpu_asi, + module->percpu_asi_size); + } static int __init asi_global_init(void) @@ -399,6 +434,8 @@ static int __init asi_global_init(void) static_branch_enable(&asi_local_map_initialized); + pcpu_map_asi_reserved_chunk(); + return 0; } subsys_initcall(asi_global_init) diff --git a/include/asm-generic/percpu.h b/include/asm-generic/percpu.h index 6432a7fade91..40001b74114f 100644 --- a/include/asm-generic/percpu.h +++ b/include/asm-generic/percpu.h @@ -50,6 +50,12 @@ extern void setup_per_cpu_areas(void); #endif /* SMP */ +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +void __init pcpu_map_asi_reserved_chunk(void); +#else +static inline void pcpu_map_asi_reserved_chunk(void) {} +#endif + #ifndef PER_CPU_BASE_SECTION #ifdef CONFIG_SMP #define PER_CPU_BASE_SECTION ".data..percpu" diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index c769d939c15f..0a931aedc285 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -1080,6 +1080,11 @@ . = ALIGN(cacheline); \ *(.data..percpu) \ *(.data..percpu..shared_aligned) \ + . = ALIGN(PAGE_SIZE); \ + __per_cpu_asi_start = .; \ + *(.data..percpu..asi_non_sensitive) \ + . = ALIGN(PAGE_SIZE); \ + __per_cpu_asi_end = .; \ PERCPU_DECRYPTED_SECTION \ __per_cpu_end = .; diff --git a/include/linux/module.h b/include/linux/module.h index 82267a95f936..d4d020bae171 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -463,6 +463,12 @@ struct module { /* Per-cpu data. */ void __percpu *percpu; unsigned int percpu_size; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + /* Per-cpu data for ASI */ + void __percpu *percpu_asi; + unsigned int percpu_asi_size; +#endif /* CONFIG_ADDRESS_SPACE_ISOLATION */ + #endif void *noinstr_text_start; unsigned int noinstr_text_size; diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h index af1071535de8..5d9fdc93e0fa 100644 --- a/include/linux/percpu-defs.h +++ b/include/linux/percpu-defs.h @@ -170,6 +170,45 @@ #define DEFINE_PER_CPU_READ_MOSTLY(type, name) \ DEFINE_PER_CPU_SECTION(type, name, "..read_mostly") +/* + * Declaration/definition used for per-CPU variables which for the sake for + * address space isolation (ASI) are deemed not sensitive + */ +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +#define ASI_PERCPU_SECTION "..asi_non_sensitive" +#else +#define ASI_PERCPU_SECTION "" +#endif + +#define DECLARE_PER_CPU_ASI_NOT_SENSITIVE(type, name) \ + DECLARE_PER_CPU_SECTION(type, name, ASI_PERCPU_SECTION) + +#define DECLARE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE(type, name) \ + DECLARE_PER_CPU_SECTION(type, name, ASI_PERCPU_SECTION) \ + ____cacheline_aligned_in_smp + +#define DECLARE_PER_CPU_ALIGNED_ASI_NOT_SENSITIVE(type, name) \ + DECLARE_PER_CPU_SECTION(type, name, ASI_PERCPU_SECTION) \ + ____cacheline_aligned + +#define DECLARE_PER_CPU_PAGE_ALIGNED_ASI_NOT_SENSITIVE(type, name) \ + DECLARE_PER_CPU_SECTION(type, name, ASI_PERCPU_SECTION) \ + __aligned(PAGE_SIZE) + +#define DEFINE_PER_CPU_ASI_NOT_SENSITIVE(type, name) \ + DEFINE_PER_CPU_SECTION(type, name, ASI_PERCPU_SECTION) + +#define DEFINE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE(type, name) \ + DEFINE_PER_CPU_SECTION(type, name, ASI_PERCPU_SECTION) \ + ____cacheline_aligned_in_smp + +#define DEFINE_PER_CPU_ALIGNED_ASI_NOT_SENSITIVE(type, name) \ + DEFINE_PER_CPU_SECTION(type, name, ASI_PERCPU_SECTION) \ + ____cacheline_aligned + +#define DEFINE_PER_CPU_PAGE_ALIGNED_ASI_NOT_SENSITIVE(type, name) \ + DEFINE_PER_CPU_SECTION(type, name, ASI_PERCPU_SECTION) \ + __aligned(PAGE_SIZE) /* * Declaration/definition used for per-CPU variables that should be accessed diff --git a/include/linux/percpu.h b/include/linux/percpu.h index ae4004e7957e..a2cc4c32cabd 100644 --- a/include/linux/percpu.h +++ b/include/linux/percpu.h @@ -13,7 +13,8 @@ /* enough to cover all DEFINE_PER_CPUs in modules */ #ifdef CONFIG_MODULES -#define PERCPU_MODULE_RESERVE (8 << 10) +/* #define PERCPU_MODULE_RESERVE (8 << 10) */ +#define PERCPU_MODULE_RESERVE (16 << 10) #else #define PERCPU_MODULE_RESERVE 0 #endif @@ -123,6 +124,11 @@ extern int __init pcpu_page_first_chunk(size_t reserved_size, #endif extern void __percpu *__alloc_reserved_percpu(size_t size, size_t align) __alloc_size(1); + +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +extern void __percpu *__alloc_reserved_percpu_asi(size_t size, size_t align); +#endif + extern bool __is_kernel_percpu_address(unsigned long addr, unsigned long *can_addr); extern bool is_kernel_percpu_address(unsigned long addr); diff --git a/kernel/module-internal.h b/kernel/module-internal.h index 33783abc377b..44c05ae06b2c 100644 --- a/kernel/module-internal.h +++ b/kernel/module-internal.h @@ -25,6 +25,7 @@ struct load_info { #endif struct { unsigned int sym, str, mod, vers, info, pcpu; + unsigned int pcpu_asi; } index; }; diff --git a/kernel/module.c b/kernel/module.c index d363b8a0ee24..0048b7843903 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -587,6 +587,13 @@ static inline void __percpu *mod_percpu(struct module *mod) return mod->percpu; } +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +static inline void __percpu *mod_percpu_asi(struct module *mod) +{ + return mod->percpu_asi; +} +#endif + static int percpu_modalloc(struct module *mod, struct load_info *info) { Elf_Shdr *pcpusec = &info->sechdrs[info->index.pcpu]; @@ -611,9 +618,34 @@ static int percpu_modalloc(struct module *mod, struct load_info *info) return 0; } +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +static int percpu_asi_modalloc(struct module *mod, struct load_info *info) +{ + Elf_Shdr *pcpusec = &info->sechdrs[info->index.pcpu_asi]; + unsigned long align = pcpusec->sh_addralign; + + if ( !pcpusec->sh_size) + return 0; + + mod->percpu_asi = __alloc_reserved_percpu_asi(pcpusec->sh_size, align); + if (!mod->percpu_asi) { + pr_warn("%s: Could not allocate %lu bytes percpu data\n", + mod->name, (unsigned long)pcpusec->sh_size); + return -ENOMEM; + } + mod->percpu_asi_size = pcpusec->sh_size; + + return 0; +} +#endif + static void percpu_modfree(struct module *mod) { free_percpu(mod->percpu); + +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + free_percpu(mod->percpu_asi); +#endif } static unsigned int find_pcpusec(struct load_info *info) @@ -621,6 +653,13 @@ static unsigned int find_pcpusec(struct load_info *info) return find_sec(info, ".data..percpu"); } +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +static unsigned int find_pcpusec_asi(struct load_info *info) +{ + return find_sec(info, ".data..percpu" ASI_PERCPU_SECTION ); +} +#endif + static void percpu_modcopy(struct module *mod, const void *from, unsigned long size) { @@ -630,6 +669,39 @@ static void percpu_modcopy(struct module *mod, memcpy(per_cpu_ptr(mod->percpu, cpu), from, size); } +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +static void percpu_asi_modcopy(struct module *mod, + const void *from, unsigned long size) +{ + int cpu; + + for_each_possible_cpu(cpu) + memcpy(per_cpu_ptr(mod->percpu_asi, cpu), from, size); +} +#endif + +bool __is_module_percpu_address_helper(unsigned long addr, + unsigned long *can_addr, + unsigned int cpu, + void* percpu_start, + unsigned int percpu_size) +{ + void *start = per_cpu_ptr(percpu_start, cpu); + void *va = (void *)addr; + + if (va >= start && va < start + percpu_size) { + if (can_addr) { + *can_addr = (unsigned long) (va - start); + *can_addr += (unsigned long) + per_cpu_ptr(percpu_start, + get_boot_cpu_id()); + } + return true; + } + + return false; +} + bool __is_module_percpu_address(unsigned long addr, unsigned long *can_addr) { struct module *mod; @@ -640,22 +712,34 @@ bool __is_module_percpu_address(unsigned long addr, unsigned long *can_addr) list_for_each_entry_rcu(mod, &modules, list) { if (mod->state == MODULE_STATE_UNFORMED) continue; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + if (!mod->percpu_size && !mod->percpu_asi_size) + continue; +#else if (!mod->percpu_size) continue; +#endif for_each_possible_cpu(cpu) { - void *start = per_cpu_ptr(mod->percpu, cpu); - void *va = (void *)addr; - - if (va >= start && va < start + mod->percpu_size) { - if (can_addr) { - *can_addr = (unsigned long) (va - start); - *can_addr += (unsigned long) - per_cpu_ptr(mod->percpu, - get_boot_cpu_id()); - } + if (__is_module_percpu_address_helper(addr, + can_addr, + cpu, + mod->percpu, + mod->percpu_size)) { preempt_enable(); return true; - } + } + +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + if (__is_module_percpu_address_helper( + addr, + can_addr, + cpu, + mod->percpu_asi, + mod->percpu_asi_size)) { + preempt_enable(); + return true; + } +#endif } } @@ -2344,6 +2428,10 @@ static int simplify_symbols(struct module *mod, const struct load_info *info) /* Divert to percpu allocation if a percpu var. */ if (sym[i].st_shndx == info->index.pcpu) secbase = (unsigned long)mod_percpu(mod); +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + else if (sym[i].st_shndx == info->index.pcpu_asi) + secbase = (unsigned long)mod_percpu_asi(mod); +#endif else secbase = info->sechdrs[sym[i].st_shndx].sh_addr; sym[i].st_value += secbase; @@ -2664,6 +2752,10 @@ static char elf_type(const Elf_Sym *sym, const struct load_info *info) return 'U'; if (sym->st_shndx == SHN_ABS || sym->st_shndx == info->index.pcpu) return 'a'; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + if (sym->st_shndx == info->index.pcpu_asi) + return 'a'; +#endif if (sym->st_shndx >= SHN_LORESERVE) return '?'; if (sechdrs[sym->st_shndx].sh_flags & SHF_EXECINSTR) @@ -2691,7 +2783,8 @@ static char elf_type(const Elf_Sym *sym, const struct load_info *info) } static bool is_core_symbol(const Elf_Sym *src, const Elf_Shdr *sechdrs, - unsigned int shnum, unsigned int pcpundx) + unsigned int shnum, unsigned int pcpundx, + unsigned pcpu_asi_ndx) { const Elf_Shdr *sec; @@ -2701,7 +2794,7 @@ static bool is_core_symbol(const Elf_Sym *src, const Elf_Shdr *sechdrs, return false; #ifdef CONFIG_KALLSYMS_ALL - if (src->st_shndx == pcpundx) + if (src->st_shndx == pcpundx || src->st_shndx == pcpu_asi_ndx ) return true; #endif @@ -2743,7 +2836,7 @@ static void layout_symtab(struct module *mod, struct load_info *info) for (ndst = i = 0; i < nsrc; i++) { if (i == 0 || is_livepatch_module(mod) || is_core_symbol(src+i, info->sechdrs, info->hdr->e_shnum, - info->index.pcpu)) { + info->index.pcpu, info->index.pcpu_asi)) { strtab_size += strlen(&info->strtab[src[i].st_name])+1; ndst++; } @@ -2807,7 +2900,7 @@ static void add_kallsyms(struct module *mod, const struct load_info *info) mod->kallsyms->typetab[i] = elf_type(src + i, info); if (i == 0 || is_livepatch_module(mod) || is_core_symbol(src+i, info->sechdrs, info->hdr->e_shnum, - info->index.pcpu)) { + info->index.pcpu, info->index.pcpu_asi)) { mod->core_kallsyms.typetab[ndst] = mod->kallsyms->typetab[i]; dst[ndst] = src[i]; @@ -3289,6 +3382,12 @@ static int setup_load_info(struct load_info *info, int flags) info->index.pcpu = find_pcpusec(info); +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + info->index.pcpu_asi = find_pcpusec_asi(info); +#else + info->index.pcpu_asi = 0; +#endif + return 0; } @@ -3629,6 +3728,12 @@ static struct module *layout_and_allocate(struct load_info *info, int flags) /* We will do a special allocation for per-cpu sections later. */ info->sechdrs[info->index.pcpu].sh_flags &= ~(unsigned long)SHF_ALLOC; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + if (info->index.pcpu_asi) + info->sechdrs[info->index.pcpu_asi].sh_flags &= + ~(unsigned long)SHF_ALLOC; +#endif + /* * Mark ro_after_init section with SHF_RO_AFTER_INIT so that * layout_sections() can put it in the right place. @@ -3700,6 +3805,14 @@ static int post_relocation(struct module *mod, const struct load_info *info) percpu_modcopy(mod, (void *)info->sechdrs[info->index.pcpu].sh_addr, info->sechdrs[info->index.pcpu].sh_size); +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + /* Copy relocated percpu ASI area over. */ + percpu_asi_modcopy( + mod, + (void *)info->sechdrs[info->index.pcpu_asi].sh_addr, + info->sechdrs[info->index.pcpu_asi].sh_size); +#endif + /* Setup kallsyms-specific fields. */ add_kallsyms(mod, info); @@ -4094,6 +4207,11 @@ static int load_module(struct load_info *info, const char __user *uargs, err = percpu_modalloc(mod, info); if (err) goto unlink_mod; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + err = percpu_asi_modalloc(mod, info); + if (err) + goto unlink_mod; +#endif /* Now module is in final location, initialize linked lists, etc. */ err = module_unload_init(mod); @@ -4183,7 +4301,11 @@ static int load_module(struct load_info *info, const char __user *uargs, /* Get rid of temporary copy. */ free_copy(info); - asi_load_module(mod); + err = asi_load_module(mod); + /* If the ASI loading failed, it doesn't necessarily mean that the + * module loading failed. We print an error and move on. */ + if (err) + pr_err("ASI: failed loading module %s", mod->name); /* Done! */ trace_module_load(mod); diff --git a/mm/percpu.c b/mm/percpu.c index beaca5adf9d4..3665a5ea71ec 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -169,6 +169,10 @@ struct pcpu_chunk *pcpu_first_chunk __ro_after_init; */ struct pcpu_chunk *pcpu_reserved_chunk __ro_after_init; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +struct pcpu_chunk *pcpu_reserved_nonsensitive_chunk __ro_after_init; +#endif + DEFINE_SPINLOCK(pcpu_lock); /* all internal data structures */ static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop, map ext */ @@ -1621,6 +1625,11 @@ static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr) if (pcpu_addr_in_chunk(pcpu_first_chunk, addr)) return pcpu_first_chunk; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + /* is it in the reserved ASI region? */ + if (pcpu_addr_in_chunk(pcpu_reserved_nonsensitive_chunk, addr)) + return pcpu_reserved_nonsensitive_chunk; +#endif /* is it in the reserved region? */ if (pcpu_addr_in_chunk(pcpu_reserved_chunk, addr)) return pcpu_reserved_chunk; @@ -1805,23 +1814,37 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, spin_lock_irqsave(&pcpu_lock, flags); +#define TRY_ALLOC_FROM_CHUNK(source_chunk, chunk_name) \ +do { \ + if (!source_chunk) { \ + err = chunk_name " chunk not allocated"; \ + goto fail_unlock; \ + } \ + chunk = source_chunk; \ + \ + off = pcpu_find_block_fit(chunk, bits, bit_align, is_atomic); \ + if (off < 0) { \ + err = "alloc from " chunk_name " chunk failed"; \ + goto fail_unlock; \ + } \ + \ + off = pcpu_alloc_area(chunk, bits, bit_align, off); \ + if (off >= 0) \ + goto area_found; \ + \ + err = "alloc from " chunk_name " chunk failed"; \ + goto fail_unlock; \ +} while(0) + /* serve reserved allocations from the reserved chunk if available */ - if (reserved && pcpu_reserved_chunk) { - chunk = pcpu_reserved_chunk; - - off = pcpu_find_block_fit(chunk, bits, bit_align, is_atomic); - if (off < 0) { - err = "alloc from reserved chunk failed"; - goto fail_unlock; - } - - off = pcpu_alloc_area(chunk, bits, bit_align, off); - if (off >= 0) - goto area_found; - - err = "alloc from reserved chunk failed"; - goto fail_unlock; - } +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + if (reserved && (gfp & __GFP_GLOBAL_NONSENSITIVE)) + TRY_ALLOC_FROM_CHUNK(pcpu_reserved_nonsensitive_chunk, + "reserverved ASI"); + else +#endif + if (reserved && pcpu_reserved_chunk) + TRY_ALLOC_FROM_CHUNK(pcpu_reserved_chunk, "reserved"); restart: /* search through normal chunks */ @@ -1998,6 +2021,14 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align) return pcpu_alloc(size, align, true, GFP_KERNEL); } +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +void __percpu *__alloc_reserved_percpu_asi(size_t size, size_t align) +{ + return pcpu_alloc(size, align, true, + GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE); +} +#endif + /** * pcpu_balance_free - manage the amount of free chunks * @empty_only: free chunks only if there are no populated pages @@ -2838,15 +2869,46 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, * the dynamic region. */ tmp_addr = (unsigned long)base_addr + static_size; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + /* If ASI is used, split the reserved size between the nonsensitive + * chunk and the normal chunk evenly. */ + map_size = (ai->reserved_size / 2) ?: dyn_size; +#else map_size = ai->reserved_size ?: dyn_size; +#endif chunk = pcpu_alloc_first_chunk(tmp_addr, map_size); /* init dynamic chunk if necessary */ if (ai->reserved_size) { - pcpu_reserved_chunk = chunk; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + /* TODO: check if ASI was enabled via boot param or static branch */ + /* We allocated pcpu_reserved_nonsensitive_chunk only if + * pcpu_reserved_chunk is used as well. */ + pcpu_reserved_nonsensitive_chunk = chunk; + pcpu_reserved_nonsensitive_chunk->is_asi_nonsensitive = true; + /* We used the previous chunk as pcpu_reserved_nonsensitive_chunk. Now + * allocate pcpu_reserved_chunk */ + tmp_addr = (unsigned long)base_addr + static_size + + (ai->reserved_size / 2); + map_size = ai->reserved_size / 2; + chunk = pcpu_alloc_first_chunk(tmp_addr, map_size); +#endif + /* Whether ASI is enabled or disabled, the end result is the + * same: + * If ASI is enabled, tmp_addr, used for pcpu_first_chunk should + * be after + * 1. pcpu_reserved_nonsensitive_chunk AND + * 2. pcpu_reserved_chunk + * Since we split the reserve size in half, we skip in total the + * whole ai->reserved_size. + * If ASI is disabled, tmp_addr, used for pcpu_first_chunk is + * just after pcpu_reserved_chunk */ tmp_addr = (unsigned long)base_addr + static_size + ai->reserved_size; + + pcpu_reserved_chunk = chunk; + map_size = dyn_size; chunk = pcpu_alloc_first_chunk(tmp_addr, map_size); } @@ -3129,7 +3191,6 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, cpu_distance_fn); if (IS_ERR(ai)) return PTR_ERR(ai); - size_sum = ai->static_size + ai->reserved_size + ai->dyn_size; areas_size = PFN_ALIGN(ai->nr_groups * sizeof(void *)); @@ -3460,3 +3521,40 @@ static int __init percpu_enable_async(void) return 0; } subsys_initcall(percpu_enable_async); + +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +void __init pcpu_map_asi_reserved_chunk(void) +{ + void *start_addr, *end_addr; + unsigned long map_start_addr, map_end_addr; + struct pcpu_chunk *chunk = pcpu_reserved_nonsensitive_chunk; + int err = 0; + + if (!chunk) + return; + + start_addr = chunk->base_addr + chunk->start_offset; + end_addr = chunk->base_addr + chunk->nr_pages * PAGE_SIZE - + chunk->end_offset; + + + /* No need in asi_map_percpu, since these addresses are "real". The + * chunk has full pages allocated, so we're not worried about leakage of + * data caused by start_addr-->end_addr not being page aligned. asi_map, + * however, will fail/crash if the addresses are not aligned. */ + map_start_addr = (unsigned long)start_addr & PAGE_MASK; + map_end_addr = PAGE_ALIGN((unsigned long)end_addr); + + pr_err("%s:%d mapping 0x%lx --> 0x%lx", + __FUNCTION__, __LINE__, map_start_addr, map_end_addr); + err = asi_map(ASI_GLOBAL_NONSENSITIVE, + (void*)map_start_addr, map_end_addr - map_start_addr); + + WARN(err, "Failed mapping percpu reserved chunk into ASI"); + + /* If we couldn't map the chuknk into ASI, it is useless. Set the chunk + * to NULL, so allocations from it will fail. */ + if (err) + pcpu_reserved_nonsensitive_chunk = NULL; +} +#endif From patchwork Wed Feb 23 05:22:17 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756447 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 30BDAC433F5 for ; Wed, 23 Feb 2022 05:28:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238430AbiBWF1v (ORCPT ); Wed, 23 Feb 2022 00:27:51 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58122 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238374AbiBWF1R (ORCPT ); Wed, 23 Feb 2022 00:27:17 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0BFFC6CA68 for ; Tue, 22 Feb 2022 21:25:25 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-2d6baed6aafso144152337b3.3 for ; Tue, 22 Feb 2022 21:25:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=fjoegs+Ro+S0LiOfkKDHuccTriIwowqx+uIdOr3iex0=; b=EbHx41T8pBJ5A2BSG8qjfuD65qMmT14wI+rGz7MtVYSyZZTtXAQ9SSi490wWxvTCX3 lF1yqHJ25Ogc1hfXuKo9v+hg5YjvOasU7cMgAwTxB6T3rHmXhsMhhA64NTfNq1j1V9t4 TXGaFy+GztCl/Zph97U1y4UYZaT827j5iNNm3e+F0mQoJhUP0sknSp9qrgRx9gx8cRMf Kb0JwVGCNa+GNMz65PcJow71XH07p87xccnzELq309vUDOD82kYnpx/YluPwYOrivr7t AT7gGVWsZa9CQMSKMpbyZuKYicDY2/ciGXtsoA+y078oLzomCpgRmU1bPsX32QTmpazD udew== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=fjoegs+Ro+S0LiOfkKDHuccTriIwowqx+uIdOr3iex0=; b=vXCQPo6FS6lhhqQYMAbyGGxAteDcJltk8f+r5WvP0cq7Jenixt1EiHWMtb7Gjno93J Tl8W8onOMaFOVKEuD+x6jeTf8motbJqYYmkjpAfoxKG/WaiuCclA8HZ2sFKxwuygEo/J R3x7XdZs9SFe6R/BGC/cZfF/OdCFSMZZhbT3T25WpK6NvsuNH+VMFRTT3y7bK1lgfRwb Q2ezU/TRplj4plMUYl/7tLniIjP0qJpOdyP8YZn7pwnzEN2LEJ0Do0p32+6k/Rmf/00m 7i0jWZl5SDzXSKmSQqj/NOeK/sexmInwTAV6c9IGn4wt8WeUwW62nkZv2QAeU1pvICXu WVMg== X-Gm-Message-State: AOAM532eBg93CGDfCbwRuH1fcoGmxZQbOTfyUYDBY+BH2WJqaP0wPpBC iYXerg9zu8eg0IS/doCSHzxNYvhS0bez X-Google-Smtp-Source: ABdhPJyEpvnRe67yGfdP/oVogGTIG0cZjG1Fj1fU4LBOman77N8SUGlY7TMto2pQXc1VwXiknsbYEeXFLlaJ X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a5b:cc8:0:b0:622:e87:2087 with SMTP id e8-20020a5b0cc8000000b006220e872087mr26256339ybr.106.1645593916871; Tue, 22 Feb 2022 21:25:16 -0800 (PST) Date: Tue, 22 Feb 2022 21:22:17 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-42-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 41/47] mm: asi: Annotation of static variables to be nonsensitive From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: Ofir Weisse , kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Ofir Weisse The heart of ASI is to diffrentiate between sensitive and non-sensitive data access. This commit marks certain static variables as not sensitive. Some static variables are accessed frequently and therefore would cause many ASI exits. The frequency of these accesses is monitored by tracing asi_exits and analyzing the accessed addresses. Many of these variables don't contain sensitive information and can therefore be mapped into the global ASI region. This commit applies the __asi_not_sensitive* attributes to these frequenmtly-accessed yet not sensitive variables. The end result is a very significant reduction in ASI exits on real benchmarks. Signed-off-by: Ofir Weisse --- arch/x86/events/core.c | 4 ++-- arch/x86/events/intel/core.c | 2 +- arch/x86/events/msr.c | 2 +- arch/x86/events/perf_event.h | 2 +- arch/x86/include/asm/kvm_host.h | 4 ++-- arch/x86/kernel/alternative.c | 2 +- arch/x86/kernel/cpu/bugs.c | 2 +- arch/x86/kernel/setup.c | 4 ++-- arch/x86/kernel/smp.c | 2 +- arch/x86/kernel/tsc.c | 8 +++---- arch/x86/kvm/lapic.c | 2 +- arch/x86/kvm/mmu/spte.c | 2 +- arch/x86/kvm/mmu/spte.h | 2 +- arch/x86/kvm/mtrr.c | 2 +- arch/x86/kvm/vmx/capabilities.h | 14 ++++++------ arch/x86/kvm/vmx/vmx.c | 37 ++++++++++++++++--------------- arch/x86/kvm/x86.c | 35 +++++++++++++++-------------- arch/x86/mm/asi.c | 4 ++-- include/linux/debug_locks.h | 4 ++-- include/linux/jiffies.h | 4 ++-- include/linux/notifier.h | 2 +- include/linux/profile.h | 2 +- include/linux/rcupdate.h | 4 +++- include/linux/rcutree.h | 2 +- include/linux/sched/sysctl.h | 1 + init/main.c | 2 +- kernel/cgroup/cgroup.c | 5 +++-- kernel/cpu.c | 14 ++++++------ kernel/events/core.c | 4 ++-- kernel/freezer.c | 2 +- kernel/locking/lockdep.c | 14 ++++++------ kernel/panic.c | 2 +- kernel/printk/printk.c | 4 ++-- kernel/profile.c | 4 ++-- kernel/rcu/tree.c | 10 ++++----- kernel/rcu/update.c | 4 ++-- kernel/sched/clock.c | 2 +- kernel/sched/core.c | 6 ++--- kernel/sched/cpuacct.c | 2 +- kernel/sched/cputime.c | 2 +- kernel/sched/fair.c | 4 ++-- kernel/sched/loadavg.c | 2 +- kernel/sched/rt.c | 2 +- kernel/sched/sched.h | 4 ++-- kernel/smp.c | 2 +- kernel/softirq.c | 3 ++- kernel/time/hrtimer.c | 2 +- kernel/time/jiffies.c | 8 ++++++- kernel/time/ntp.c | 30 ++++++++++++------------- kernel/time/tick-common.c | 4 ++-- kernel/time/tick-internal.h | 2 +- kernel/time/tick-sched.c | 2 +- kernel/time/timekeeping.c | 10 ++++----- kernel/time/timekeeping.h | 2 +- kernel/time/timer.c | 2 +- kernel/trace/trace.c | 2 +- kernel/trace/trace_sched_switch.c | 4 ++-- lib/debug_locks.c | 5 +++-- mm/memory.c | 2 +- mm/page_alloc.c | 2 +- mm/sparse.c | 4 ++-- virt/kvm/kvm_main.c | 2 +- 62 files changed, 170 insertions(+), 156 deletions(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 38b2c779146f..db825bf053fd 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -44,7 +44,7 @@ #include "perf_event.h" -struct x86_pmu x86_pmu __read_mostly; +struct x86_pmu x86_pmu __asi_not_sensitive_readmostly; static struct pmu pmu; DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events) = { @@ -2685,7 +2685,7 @@ static int x86_pmu_filter_match(struct perf_event *event) return 1; } -static struct pmu pmu = { +static struct pmu pmu __asi_not_sensitive = { .pmu_enable = x86_pmu_enable, .pmu_disable = x86_pmu_disable, diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index ec6444f2c9dc..5b2b7473b2f2 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -189,7 +189,7 @@ static struct event_constraint intel_slm_event_constraints[] __read_mostly = EVENT_CONSTRAINT_END }; -static struct event_constraint intel_skl_event_constraints[] = { +static struct event_constraint intel_skl_event_constraints[] __asi_not_sensitive = { FIXED_EVENT_CONSTRAINT(0x00c0, 0), /* INST_RETIRED.ANY */ FIXED_EVENT_CONSTRAINT(0x003c, 1), /* CPU_CLK_UNHALTED.CORE */ FIXED_EVENT_CONSTRAINT(0x0300, 2), /* CPU_CLK_UNHALTED.REF */ diff --git a/arch/x86/events/msr.c b/arch/x86/events/msr.c index 96c775abe31f..db7bca37c726 100644 --- a/arch/x86/events/msr.c +++ b/arch/x86/events/msr.c @@ -280,7 +280,7 @@ static int msr_event_add(struct perf_event *event, int flags) return 0; } -static struct pmu pmu_msr = { +static struct pmu pmu_msr __asi_not_sensitive = { .task_ctx_nr = perf_sw_context, .attr_groups = attr_groups, .event_init = msr_event_init, diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 5480db242083..27cca7fd6f17 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -1020,7 +1020,7 @@ static struct perf_pmu_format_hybrid_attr format_attr_hybrid_##_name = {\ } struct pmu *x86_get_pmu(unsigned int cpu); -extern struct x86_pmu x86_pmu __read_mostly; +extern struct x86_pmu x86_pmu __asi_not_sensitive_readmostly; static __always_inline struct x86_perf_task_context_opt *task_context_opt(void *ctx) { diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 8ba88bbcf895..b7292c4fece7 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1542,8 +1542,8 @@ struct kvm_arch_async_pf { extern u32 __read_mostly kvm_nr_uret_msrs; extern u64 __read_mostly host_efer; -extern bool __read_mostly allow_smaller_maxphyaddr; -extern bool __read_mostly enable_apicv; +extern bool __asi_not_sensitive_readmostly allow_smaller_maxphyaddr; +extern bool __asi_not_sensitive_readmostly enable_apicv; extern struct kvm_x86_ops kvm_x86_ops; #define KVM_X86_OP(func) \ diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 23fb4d51a5da..9836ebe953ed 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -31,7 +31,7 @@ #include #include -int __read_mostly alternatives_patched; +int __asi_not_sensitive alternatives_patched; EXPORT_SYMBOL_GPL(alternatives_patched); diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 1c1f218a701d..6b5e6574e391 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -46,7 +46,7 @@ static void __init srbds_select_mitigation(void); static void __init l1d_flush_select_mitigation(void); /* The base value of the SPEC_CTRL MSR that always has to be preserved. */ -u64 x86_spec_ctrl_base; +u64 x86_spec_ctrl_base __asi_not_sensitive; EXPORT_SYMBOL_GPL(x86_spec_ctrl_base); static DEFINE_MUTEX(spec_ctrl_mutex); diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index e04f5e6eb33f..d8461ac88b36 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -116,7 +116,7 @@ static struct resource bss_resource = { struct cpuinfo_x86 new_cpu_data; /* Common CPU data for all CPUs */ -struct cpuinfo_x86 boot_cpu_data __read_mostly; +struct cpuinfo_x86 boot_cpu_data __asi_not_sensitive_readmostly; EXPORT_SYMBOL(boot_cpu_data); unsigned int def_to_bigsmp; @@ -133,7 +133,7 @@ struct ist_info ist_info; #endif #else -struct cpuinfo_x86 boot_cpu_data __read_mostly; +struct cpuinfo_x86 boot_cpu_data __asi_not_sensitive_readmostly; EXPORT_SYMBOL(boot_cpu_data); #endif diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c index 06db901fabe8..e9e10ffc2ec2 100644 --- a/arch/x86/kernel/smp.c +++ b/arch/x86/kernel/smp.c @@ -257,7 +257,7 @@ static int __init nonmi_ipi_setup(char *str) __setup("nonmi_ipi", nonmi_ipi_setup); -struct smp_ops smp_ops = { +struct smp_ops smp_ops __asi_not_sensitive = { .smp_prepare_boot_cpu = native_smp_prepare_boot_cpu, .smp_prepare_cpus = native_smp_prepare_cpus, .smp_cpus_done = native_smp_cpus_done, diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index a698196377be..d7169da99b01 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -30,10 +30,10 @@ #include #include -unsigned int __read_mostly cpu_khz; /* TSC clocks / usec, not used here */ +unsigned int __asi_not_sensitive_readmostly cpu_khz; /* TSC clocks / usec, not used here */ EXPORT_SYMBOL(cpu_khz); -unsigned int __read_mostly tsc_khz; +unsigned int __asi_not_sensitive_readmostly tsc_khz; EXPORT_SYMBOL(tsc_khz); #define KHZ 1000 @@ -41,7 +41,7 @@ EXPORT_SYMBOL(tsc_khz); /* * TSC can be unstable due to cpufreq or due to unsynced TSCs */ -static int __read_mostly tsc_unstable; +static int __asi_not_sensitive_readmostly tsc_unstable; static unsigned int __initdata tsc_early_khz; static DEFINE_STATIC_KEY_FALSE(__use_tsc); @@ -1146,7 +1146,7 @@ static struct clocksource clocksource_tsc_early = { * this one will immediately take over. We will only register if TSC has * been found good. */ -static struct clocksource clocksource_tsc = { +static struct clocksource clocksource_tsc __asi_not_sensitive = { .name = "tsc", .rating = 300, .read = read_tsc, diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index f206fc35deff..213bbdfab49e 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -60,7 +60,7 @@ #define MAX_APIC_VECTOR 256 #define APIC_VECTORS_PER_REG 32 -static bool lapic_timer_advance_dynamic __read_mostly; +static bool lapic_timer_advance_dynamic __asi_not_sensitive_readmostly; #define LAPIC_TIMER_ADVANCE_ADJUST_MIN 100 /* clock cycles */ #define LAPIC_TIMER_ADVANCE_ADJUST_MAX 10000 /* clock cycles */ #define LAPIC_TIMER_ADVANCE_NS_INIT 1000 diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c index 0c76c45fdb68..13038fae5088 100644 --- a/arch/x86/kvm/mmu/spte.c +++ b/arch/x86/kvm/mmu/spte.c @@ -33,7 +33,7 @@ u64 __read_mostly shadow_mmio_mask; u64 __read_mostly shadow_mmio_access_mask; u64 __read_mostly shadow_present_mask; u64 __read_mostly shadow_me_mask; -u64 __read_mostly shadow_acc_track_mask; +u64 __asi_not_sensitive_readmostly shadow_acc_track_mask; u64 __read_mostly shadow_nonpresent_or_rsvd_mask; u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask; diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h index cc432f9a966b..d1af03f63009 100644 --- a/arch/x86/kvm/mmu/spte.h +++ b/arch/x86/kvm/mmu/spte.h @@ -151,7 +151,7 @@ extern u64 __read_mostly shadow_me_mask; * shadow_acc_track_mask is the set of bits to be cleared in non-accessed * pages. */ -extern u64 __read_mostly shadow_acc_track_mask; +extern u64 __asi_not_sensitive_readmostly shadow_acc_track_mask; /* * This mask must be set on all non-zero Non-Present or Reserved SPTEs in order diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c index a8502e02f479..66228abfa9fa 100644 --- a/arch/x86/kvm/mtrr.c +++ b/arch/x86/kvm/mtrr.c @@ -138,7 +138,7 @@ struct fixed_mtrr_segment { int range_start; }; -static struct fixed_mtrr_segment fixed_seg_table[] = { +static struct fixed_mtrr_segment fixed_seg_table[] __asi_not_sensitive = { /* MSR_MTRRfix64K_00000, 1 unit. 64K fixed mtrr. */ { .start = 0x0, diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h index 4705ad55abb5..0ab03ec7d6d0 100644 --- a/arch/x86/kvm/vmx/capabilities.h +++ b/arch/x86/kvm/vmx/capabilities.h @@ -6,13 +6,13 @@ #include "lapic.h" -extern bool __read_mostly enable_vpid; -extern bool __read_mostly flexpriority_enabled; -extern bool __read_mostly enable_ept; -extern bool __read_mostly enable_unrestricted_guest; -extern bool __read_mostly enable_ept_ad_bits; -extern bool __read_mostly enable_pml; -extern int __read_mostly pt_mode; +extern bool __asi_not_sensitive_readmostly enable_vpid; +extern bool __asi_not_sensitive_readmostly flexpriority_enabled; +extern bool __asi_not_sensitive_readmostly enable_ept; +extern bool __asi_not_sensitive_readmostly enable_unrestricted_guest; +extern bool __asi_not_sensitive_readmostly enable_ept_ad_bits; +extern bool __asi_not_sensitive_readmostly enable_pml; +extern int __asi_not_sensitive_readmostly pt_mode; #define PT_MODE_SYSTEM 0 #define PT_MODE_HOST_GUEST 1 diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 6549fef39f2b..e1ad82c25a78 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -78,29 +78,29 @@ static const struct x86_cpu_id vmx_cpu_id[] = { MODULE_DEVICE_TABLE(x86cpu, vmx_cpu_id); #endif -bool __read_mostly enable_vpid = 1; +bool __asi_not_sensitive_readmostly enable_vpid = 1; module_param_named(vpid, enable_vpid, bool, 0444); -static bool __read_mostly enable_vnmi = 1; +static bool __asi_not_sensitive_readmostly enable_vnmi = 1; module_param_named(vnmi, enable_vnmi, bool, S_IRUGO); -bool __read_mostly flexpriority_enabled = 1; +bool __asi_not_sensitive_readmostly flexpriority_enabled = 1; module_param_named(flexpriority, flexpriority_enabled, bool, S_IRUGO); -bool __read_mostly enable_ept = 1; +bool __asi_not_sensitive_readmostly enable_ept = 1; module_param_named(ept, enable_ept, bool, S_IRUGO); -bool __read_mostly enable_unrestricted_guest = 1; +bool __asi_not_sensitive_readmostly enable_unrestricted_guest = 1; module_param_named(unrestricted_guest, enable_unrestricted_guest, bool, S_IRUGO); -bool __read_mostly enable_ept_ad_bits = 1; +bool __asi_not_sensitive_readmostly enable_ept_ad_bits = 1; module_param_named(eptad, enable_ept_ad_bits, bool, S_IRUGO); -static bool __read_mostly emulate_invalid_guest_state = true; +static bool __asi_not_sensitive_readmostly emulate_invalid_guest_state = true; module_param(emulate_invalid_guest_state, bool, S_IRUGO); -static bool __read_mostly fasteoi = 1; +static bool __asi_not_sensitive_readmostly fasteoi = 1; module_param(fasteoi, bool, S_IRUGO); module_param(enable_apicv, bool, S_IRUGO); @@ -110,13 +110,13 @@ module_param(enable_apicv, bool, S_IRUGO); * VMX and be a hypervisor for its own guests. If nested=0, guests may not * use VMX instructions. */ -static bool __read_mostly nested = 1; +static bool __asi_not_sensitive_readmostly nested = 1; module_param(nested, bool, S_IRUGO); -bool __read_mostly enable_pml = 1; +bool __asi_not_sensitive_readmostly enable_pml = 1; module_param_named(pml, enable_pml, bool, S_IRUGO); -static bool __read_mostly dump_invalid_vmcs = 0; +static bool __asi_not_sensitive_readmostly dump_invalid_vmcs = 0; module_param(dump_invalid_vmcs, bool, 0644); #define MSR_BITMAP_MODE_X2APIC 1 @@ -125,13 +125,13 @@ module_param(dump_invalid_vmcs, bool, 0644); #define KVM_VMX_TSC_MULTIPLIER_MAX 0xffffffffffffffffULL /* Guest_tsc -> host_tsc conversion requires 64-bit division. */ -static int __read_mostly cpu_preemption_timer_multi; -static bool __read_mostly enable_preemption_timer = 1; +static int __asi_not_sensitive_readmostly cpu_preemption_timer_multi; +static bool __asi_not_sensitive_readmostly enable_preemption_timer = 1; #ifdef CONFIG_X86_64 module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO); #endif -extern bool __read_mostly allow_smaller_maxphyaddr; +extern bool __asi_not_sensitive_readmostly allow_smaller_maxphyaddr; module_param(allow_smaller_maxphyaddr, bool, S_IRUGO); #define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD) @@ -202,7 +202,7 @@ static unsigned int ple_window_max = KVM_VMX_DEFAULT_PLE_WINDOW_MAX; module_param(ple_window_max, uint, 0444); /* Default is SYSTEM mode, 1 for host-guest mode */ -int __read_mostly pt_mode = PT_MODE_SYSTEM; +int __asi_not_sensitive_readmostly pt_mode = PT_MODE_SYSTEM; module_param(pt_mode, int, S_IRUGO); static DEFINE_STATIC_KEY_FALSE(vmx_l1d_should_flush); @@ -421,7 +421,7 @@ static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu); static DECLARE_BITMAP(vmx_vpid_bitmap, VMX_NR_VPIDS); static DEFINE_SPINLOCK(vmx_vpid_lock); -struct vmcs_config vmcs_config; +struct vmcs_config vmcs_config __asi_not_sensitive; struct vmx_capability vmx_capability; #define VMX_SEGMENT_FIELD(seg) \ @@ -453,7 +453,7 @@ static inline void vmx_segment_cache_clear(struct vcpu_vmx *vmx) vmx->segment_cache.bitmask = 0; } -static unsigned long host_idt_base; +static unsigned long host_idt_base __asi_not_sensitive; #if IS_ENABLED(CONFIG_HYPERV) static bool __read_mostly enlightened_vmcs = true; @@ -5549,7 +5549,8 @@ static int handle_bus_lock_vmexit(struct kvm_vcpu *vcpu) * may resume. Otherwise they set the kvm_run parameter to indicate what needs * to be done to userspace and return 0. */ -static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = { +static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) __asi_not_sensitive += { [EXIT_REASON_EXCEPTION_NMI] = handle_exception_nmi, [EXIT_REASON_EXTERNAL_INTERRUPT] = handle_external_interrupt, [EXIT_REASON_TRIPLE_FAULT] = handle_triple_fault, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d0df14deae80..0df88eadab60 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -123,7 +123,7 @@ static int sync_regs(struct kvm_vcpu *vcpu); static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2); static void __get_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2); -struct kvm_x86_ops kvm_x86_ops __read_mostly; +struct kvm_x86_ops kvm_x86_ops __asi_not_sensitive_readmostly; EXPORT_SYMBOL_GPL(kvm_x86_ops); #define KVM_X86_OP(func) \ @@ -148,17 +148,17 @@ module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR); static bool __read_mostly kvmclock_periodic_sync = true; module_param(kvmclock_periodic_sync, bool, S_IRUGO); -bool __read_mostly kvm_has_tsc_control; +bool __asi_not_sensitive_readmostly kvm_has_tsc_control; EXPORT_SYMBOL_GPL(kvm_has_tsc_control); -u32 __read_mostly kvm_max_guest_tsc_khz; +u32 __asi_not_sensitive_readmostly kvm_max_guest_tsc_khz; EXPORT_SYMBOL_GPL(kvm_max_guest_tsc_khz); -u8 __read_mostly kvm_tsc_scaling_ratio_frac_bits; +u8 __asi_not_sensitive_readmostly kvm_tsc_scaling_ratio_frac_bits; EXPORT_SYMBOL_GPL(kvm_tsc_scaling_ratio_frac_bits); -u64 __read_mostly kvm_max_tsc_scaling_ratio; +u64 __asi_not_sensitive_readmostly kvm_max_tsc_scaling_ratio; EXPORT_SYMBOL_GPL(kvm_max_tsc_scaling_ratio); -u64 __read_mostly kvm_default_tsc_scaling_ratio; +u64 __asi_not_sensitive_readmostly kvm_default_tsc_scaling_ratio; EXPORT_SYMBOL_GPL(kvm_default_tsc_scaling_ratio); -bool __read_mostly kvm_has_bus_lock_exit; +bool __asi_not_sensitive_readmostly kvm_has_bus_lock_exit; EXPORT_SYMBOL_GPL(kvm_has_bus_lock_exit); /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */ @@ -171,20 +171,20 @@ module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR); * advancement entirely. Any other value is used as-is and disables adaptive * tuning, i.e. allows privileged userspace to set an exact advancement time. */ -static int __read_mostly lapic_timer_advance_ns = -1; +static int __asi_not_sensitive_readmostly lapic_timer_advance_ns = -1; module_param(lapic_timer_advance_ns, int, S_IRUGO | S_IWUSR); -static bool __read_mostly vector_hashing = true; +static bool __asi_not_sensitive_readmostly vector_hashing = true; module_param(vector_hashing, bool, S_IRUGO); -bool __read_mostly enable_vmware_backdoor = false; +bool __asi_not_sensitive_readmostly enable_vmware_backdoor = false; module_param(enable_vmware_backdoor, bool, S_IRUGO); EXPORT_SYMBOL_GPL(enable_vmware_backdoor); -static bool __read_mostly force_emulation_prefix = false; +static bool __asi_not_sensitive_readmostly force_emulation_prefix = false; module_param(force_emulation_prefix, bool, S_IRUGO); -int __read_mostly pi_inject_timer = -1; +int __asi_not_sensitive_readmostly pi_inject_timer = -1; module_param(pi_inject_timer, bint, S_IRUGO | S_IWUSR); /* @@ -216,13 +216,14 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs; u64 __read_mostly host_efer; EXPORT_SYMBOL_GPL(host_efer); -bool __read_mostly allow_smaller_maxphyaddr = 0; +bool __asi_not_sensitive_readmostly allow_smaller_maxphyaddr = 0; EXPORT_SYMBOL_GPL(allow_smaller_maxphyaddr); -bool __read_mostly enable_apicv = true; +bool __asi_not_sensitive_readmostly enable_apicv = true; EXPORT_SYMBOL_GPL(enable_apicv); -u64 __read_mostly host_xss; +/* TODO(oweisse): how dangerous is this variable, from a security standpoint? */ +u64 __asi_not_sensitive_readmostly host_xss; EXPORT_SYMBOL_GPL(host_xss); u64 __read_mostly supported_xss; EXPORT_SYMBOL_GPL(supported_xss); @@ -292,7 +293,7 @@ const struct kvm_stats_header kvm_vcpu_stats_header = { sizeof(kvm_vcpu_stats_desc), }; -u64 __read_mostly host_xcr0; +u64 __asi_not_sensitive_readmostly host_xcr0; u64 __read_mostly supported_xcr0; EXPORT_SYMBOL_GPL(supported_xcr0); @@ -2077,7 +2078,7 @@ struct pvclock_gtod_data { u64 wall_time_sec; }; -static struct pvclock_gtod_data pvclock_gtod_data; +static struct pvclock_gtod_data pvclock_gtod_data __asi_not_sensitive; static void update_pvclock_gtod(struct timekeeper *tk) { diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c index ba373b461855..fdc117929fc7 100644 --- a/arch/x86/mm/asi.c +++ b/arch/x86/mm/asi.c @@ -17,8 +17,8 @@ #undef pr_fmt #define pr_fmt(fmt) "ASI: " fmt -static struct asi_class asi_class[ASI_MAX_NUM]; -static DEFINE_SPINLOCK(asi_class_lock); +static struct asi_class asi_class[ASI_MAX_NUM] __asi_not_sensitive; +static DEFINE_SPINLOCK(asi_class_lock __asi_not_sensitive); DEFINE_PER_CPU_ALIGNED(struct asi_state, asi_cpu_state); EXPORT_PER_CPU_SYMBOL_GPL(asi_cpu_state); diff --git a/include/linux/debug_locks.h b/include/linux/debug_locks.h index dbb409d77d4f..7bd0c3dd6d47 100644 --- a/include/linux/debug_locks.h +++ b/include/linux/debug_locks.h @@ -7,8 +7,8 @@ struct task_struct; -extern int debug_locks __read_mostly; -extern int debug_locks_silent __read_mostly; +extern int debug_locks; +extern int debug_locks_silent; static __always_inline int __debug_locks_off(void) diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h index 5e13f801c902..deccab0dcb4a 100644 --- a/include/linux/jiffies.h +++ b/include/linux/jiffies.h @@ -76,8 +76,8 @@ extern int register_refined_jiffies(long clock_tick_rate); * without sampling the sequence number in jiffies_lock. * get_jiffies_64() will do this for you as appropriate. */ -extern u64 __cacheline_aligned_in_smp jiffies_64; -extern unsigned long volatile __cacheline_aligned_in_smp __jiffy_arch_data jiffies; +extern u64 jiffies_64; +extern unsigned long volatile __jiffy_arch_data jiffies; #if (BITS_PER_LONG < 64) u64 get_jiffies_64(void); diff --git a/include/linux/notifier.h b/include/linux/notifier.h index 87069b8459af..a27b193b8e60 100644 --- a/include/linux/notifier.h +++ b/include/linux/notifier.h @@ -117,7 +117,7 @@ extern void srcu_init_notifier_head(struct srcu_notifier_head *nh); struct blocking_notifier_head name = \ BLOCKING_NOTIFIER_INIT(name) #define RAW_NOTIFIER_HEAD(name) \ - struct raw_notifier_head name = \ + struct raw_notifier_head name __asi_not_sensitive = \ RAW_NOTIFIER_INIT(name) #ifdef CONFIG_TREE_SRCU diff --git a/include/linux/profile.h b/include/linux/profile.h index fd18ca96f557..4988b6d05d4c 100644 --- a/include/linux/profile.h +++ b/include/linux/profile.h @@ -38,7 +38,7 @@ enum profile_type { #ifdef CONFIG_PROFILING -extern int prof_on __read_mostly; +extern int prof_on; /* init basic kernel profiler */ int profile_init(void); diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 5e0beb5c5659..34f5073c88a2 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -84,7 +84,7 @@ static inline int rcu_preempt_depth(void) /* Internal to kernel */ void rcu_init(void); -extern int rcu_scheduler_active __read_mostly; +extern int rcu_scheduler_active; void rcu_sched_clock_irq(int user); void rcu_report_dead(unsigned int cpu); void rcutree_migrate_callbacks(int cpu); @@ -308,6 +308,8 @@ static inline int rcu_read_lock_any_held(void) #ifdef CONFIG_PROVE_RCU +/* TODO: ASI - (oweisse) we might want to switch ".data.unlikely" to some other + * section that will be mapped to ASI. */ /** * RCU_LOCKDEP_WARN - emit lockdep splat if specified condition is met * @c: condition to check diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h index 53209d669400..76665db179fa 100644 --- a/include/linux/rcutree.h +++ b/include/linux/rcutree.h @@ -62,7 +62,7 @@ static inline void rcu_irq_exit_check_preempt(void) { } void exit_rcu(void); void rcu_scheduler_starting(void); -extern int rcu_scheduler_active __read_mostly; +extern int rcu_scheduler_active; void rcu_end_inkernel_boot(void); bool rcu_inkernel_boot_has_ended(void); bool rcu_is_watching(void); diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 304f431178fd..1529e3835939 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -3,6 +3,7 @@ #define _LINUX_SCHED_SYSCTL_H #include +#include struct ctl_table; diff --git a/init/main.c b/init/main.c index bb984ed79de0..ce87fac83aed 100644 --- a/init/main.c +++ b/init/main.c @@ -123,7 +123,7 @@ extern void radix_tree_init(void); * operations which are not allowed with IRQ disabled are allowed while the * flag is set. */ -bool early_boot_irqs_disabled __read_mostly; +bool early_boot_irqs_disabled __asi_not_sensitive; enum system_states system_state __read_mostly; EXPORT_SYMBOL(system_state); diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index cafb8c114a21..729495e17363 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -162,7 +162,8 @@ static struct static_key_true *cgroup_subsys_on_dfl_key[] = { static DEFINE_PER_CPU(struct cgroup_rstat_cpu, cgrp_dfl_root_rstat_cpu); /* the default hierarchy */ -struct cgroup_root cgrp_dfl_root = { .cgrp.rstat_cpu = &cgrp_dfl_root_rstat_cpu }; +struct cgroup_root cgrp_dfl_root __asi_not_sensitive = + { .cgrp.rstat_cpu = &cgrp_dfl_root_rstat_cpu }; EXPORT_SYMBOL_GPL(cgrp_dfl_root); /* @@ -755,7 +756,7 @@ EXPORT_SYMBOL_GPL(of_css); * reference-counted, to improve performance when child cgroups * haven't been created. */ -struct css_set init_css_set = { +struct css_set init_css_set __asi_not_sensitive = { .refcount = REFCOUNT_INIT(1), .dom_cset = &init_css_set, .tasks = LIST_HEAD_INIT(init_css_set.tasks), diff --git a/kernel/cpu.c b/kernel/cpu.c index 407a2568f35e..59530bd5da39 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -2581,26 +2581,26 @@ const DECLARE_BITMAP(cpu_all_bits, NR_CPUS) = CPU_BITS_ALL; EXPORT_SYMBOL(cpu_all_bits); #ifdef CONFIG_INIT_ALL_POSSIBLE -struct cpumask __cpu_possible_mask __read_mostly +struct cpumask __cpu_possible_mask __asi_not_sensitive_readmostly = {CPU_BITS_ALL}; #else -struct cpumask __cpu_possible_mask __read_mostly; +struct cpumask __cpu_possible_mask __asi_not_sensitive_readmostly; #endif EXPORT_SYMBOL(__cpu_possible_mask); -struct cpumask __cpu_online_mask __read_mostly; +struct cpumask __cpu_online_mask __asi_not_sensitive_readmostly; EXPORT_SYMBOL(__cpu_online_mask); -struct cpumask __cpu_present_mask __read_mostly; +struct cpumask __cpu_present_mask __asi_not_sensitive_readmostly; EXPORT_SYMBOL(__cpu_present_mask); -struct cpumask __cpu_active_mask __read_mostly; +struct cpumask __cpu_active_mask __asi_not_sensitive_readmostly; EXPORT_SYMBOL(__cpu_active_mask); -struct cpumask __cpu_dying_mask __read_mostly; +struct cpumask __cpu_dying_mask __asi_not_sensitive_readmostly; EXPORT_SYMBOL(__cpu_dying_mask); -atomic_t __num_online_cpus __read_mostly; +atomic_t __num_online_cpus __asi_not_sensitive_readmostly; EXPORT_SYMBOL(__num_online_cpus); void init_cpu_present(const struct cpumask *src) diff --git a/kernel/events/core.c b/kernel/events/core.c index 30d94f68c5bd..6ea559b6e0f4 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -9651,7 +9651,7 @@ static int perf_swevent_init(struct perf_event *event) return 0; } -static struct pmu perf_swevent = { +static struct pmu perf_swevent __asi_not_sensitive = { .task_ctx_nr = perf_sw_context, .capabilities = PERF_PMU_CAP_NO_NMI, @@ -9800,7 +9800,7 @@ static int perf_tp_event_init(struct perf_event *event) return 0; } -static struct pmu perf_tracepoint = { +static struct pmu perf_tracepoint __asi_not_sensitive = { .task_ctx_nr = perf_sw_context, .event_init = perf_tp_event_init, diff --git a/kernel/freezer.c b/kernel/freezer.c index 45ab36ffd0e7..6ca163e4880b 100644 --- a/kernel/freezer.c +++ b/kernel/freezer.c @@ -13,7 +13,7 @@ #include /* total number of freezing conditions in effect */ -atomic_t system_freezing_cnt = ATOMIC_INIT(0); +atomic_t __asi_not_sensitive system_freezing_cnt = ATOMIC_INIT(0); EXPORT_SYMBOL(system_freezing_cnt); /* indicate whether PM freezing is in effect, protected by diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 2270ec68f10a..1b8f51a37883 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -64,7 +64,7 @@ #include #ifdef CONFIG_PROVE_LOCKING -int prove_locking = 1; +int prove_locking __asi_not_sensitive = 1; module_param(prove_locking, int, 0644); #else #define prove_locking 0 @@ -186,8 +186,8 @@ unsigned long nr_zapped_classes; #ifndef CONFIG_DEBUG_LOCKDEP static #endif -struct lock_class lock_classes[MAX_LOCKDEP_KEYS]; -static DECLARE_BITMAP(lock_classes_in_use, MAX_LOCKDEP_KEYS); +struct lock_class lock_classes[MAX_LOCKDEP_KEYS] __asi_not_sensitive; +static DECLARE_BITMAP(lock_classes_in_use, MAX_LOCKDEP_KEYS) __asi_not_sensitive; static inline struct lock_class *hlock_class(struct held_lock *hlock) { @@ -389,7 +389,7 @@ static struct hlist_head classhash_table[CLASSHASH_SIZE]; #define __chainhashfn(chain) hash_long(chain, CHAINHASH_BITS) #define chainhashentry(chain) (chainhash_table + __chainhashfn((chain))) -static struct hlist_head chainhash_table[CHAINHASH_SIZE]; +static struct hlist_head chainhash_table[CHAINHASH_SIZE] __asi_not_sensitive; /* * the id of held_lock @@ -599,7 +599,7 @@ u64 lockdep_stack_hash_count(void) unsigned int nr_hardirq_chains; unsigned int nr_softirq_chains; unsigned int nr_process_chains; -unsigned int max_lockdep_depth; +unsigned int max_lockdep_depth __asi_not_sensitive; #ifdef CONFIG_DEBUG_LOCKDEP /* @@ -3225,8 +3225,8 @@ check_prevs_add(struct task_struct *curr, struct held_lock *next) return 0; } -struct lock_chain lock_chains[MAX_LOCKDEP_CHAINS]; -static DECLARE_BITMAP(lock_chains_in_use, MAX_LOCKDEP_CHAINS); +struct lock_chain lock_chains[MAX_LOCKDEP_CHAINS] __asi_not_sensitive; +static DECLARE_BITMAP(lock_chains_in_use, MAX_LOCKDEP_CHAINS) __asi_not_sensitive; static u16 chain_hlocks[MAX_LOCKDEP_CHAIN_HLOCKS]; unsigned long nr_zapped_lock_chains; unsigned int nr_free_chain_hlocks; /* Free chain_hlocks in buckets */ diff --git a/kernel/panic.c b/kernel/panic.c index cefd7d82366f..6d0ee3ddd58b 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -56,7 +56,7 @@ int panic_on_warn __read_mostly; unsigned long panic_on_taint; bool panic_on_taint_nousertaint = false; -int panic_timeout = CONFIG_PANIC_TIMEOUT; +int panic_timeout __asi_not_sensitive = CONFIG_PANIC_TIMEOUT; EXPORT_SYMBOL_GPL(panic_timeout); #define PANIC_PRINT_TASK_INFO 0x00000001 diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 57b132b658e1..3425fb1554d3 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -75,7 +75,7 @@ EXPORT_SYMBOL(ignore_console_lock_warning); * Low level drivers may need that to know if they can schedule in * their unblank() callback or not. So let's export it. */ -int oops_in_progress; +int oops_in_progress __asi_not_sensitive; EXPORT_SYMBOL(oops_in_progress); /* @@ -2001,7 +2001,7 @@ static u8 *__printk_recursion_counter(void) local_irq_restore(flags); \ } while (0) -int printk_delay_msec __read_mostly; +int printk_delay_msec __asi_not_sensitive_readmostly; static inline void printk_delay(void) { diff --git a/kernel/profile.c b/kernel/profile.c index eb9c7f0f5ac5..c5beb9b0b0a8 100644 --- a/kernel/profile.c +++ b/kernel/profile.c @@ -44,10 +44,10 @@ static atomic_t *prof_buffer; static unsigned long prof_len; static unsigned short int prof_shift; -int prof_on __read_mostly; +int prof_on __asi_not_sensitive_readmostly; EXPORT_SYMBOL_GPL(prof_on); -static cpumask_var_t prof_cpu_mask; +static cpumask_var_t prof_cpu_mask __asi_not_sensitive; #if defined(CONFIG_SMP) && defined(CONFIG_PROC_FS) static DEFINE_PER_CPU(struct profile_hit *[2], cpu_profile_hits); static DEFINE_PER_CPU(int, cpu_profile_flip); diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index ef8d36f580fc..284d2722cf0c 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -82,7 +82,7 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data) = { .cblist.flags = SEGCBLIST_SOFTIRQ_ONLY, #endif }; -static struct rcu_state rcu_state = { +static struct rcu_state rcu_state __asi_not_sensitive = { .level = { &rcu_state.node[0] }, .gp_state = RCU_GP_IDLE, .gp_seq = (0UL - 300UL) << RCU_SEQ_CTR_SHIFT, @@ -98,7 +98,7 @@ static struct rcu_state rcu_state = { static bool dump_tree; module_param(dump_tree, bool, 0444); /* By default, use RCU_SOFTIRQ instead of rcuc kthreads. */ -static bool use_softirq = !IS_ENABLED(CONFIG_PREEMPT_RT); +static __asi_not_sensitive bool use_softirq = !IS_ENABLED(CONFIG_PREEMPT_RT); #ifndef CONFIG_PREEMPT_RT module_param(use_softirq, bool, 0444); #endif @@ -125,7 +125,7 @@ int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */ * transitions from RCU_SCHEDULER_INIT to RCU_SCHEDULER_RUNNING after RCU * is fully initialized, including all of its kthreads having been spawned. */ -int rcu_scheduler_active __read_mostly; +int rcu_scheduler_active __asi_not_sensitive; EXPORT_SYMBOL_GPL(rcu_scheduler_active); /* @@ -140,7 +140,7 @@ EXPORT_SYMBOL_GPL(rcu_scheduler_active); * early boot to take responsibility for these callbacks, but one step at * a time. */ -static int rcu_scheduler_fully_active __read_mostly; +static int rcu_scheduler_fully_active __asi_not_sensitive; static void rcu_report_qs_rnp(unsigned long mask, struct rcu_node *rnp, unsigned long gps, unsigned long flags); @@ -470,7 +470,7 @@ module_param(qovld, long, 0444); static ulong jiffies_till_first_fqs = IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) ? 0 : ULONG_MAX; static ulong jiffies_till_next_fqs = ULONG_MAX; -static bool rcu_kick_kthreads; +static bool rcu_kick_kthreads __asi_not_sensitive; static int rcu_divisor = 7; module_param(rcu_divisor, int, 0644); diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 156892c22bb5..b61a3854e62d 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -243,7 +243,7 @@ core_initcall(rcu_set_runtime_mode); #ifdef CONFIG_DEBUG_LOCK_ALLOC static struct lock_class_key rcu_lock_key; -struct lockdep_map rcu_lock_map = { +struct lockdep_map rcu_lock_map __asi_not_sensitive = { .name = "rcu_read_lock", .key = &rcu_lock_key, .wait_type_outer = LD_WAIT_FREE, @@ -494,7 +494,7 @@ EXPORT_SYMBOL_GPL(rcutorture_sched_setaffinity); #ifdef CONFIG_RCU_STALL_COMMON int rcu_cpu_stall_ftrace_dump __read_mostly; module_param(rcu_cpu_stall_ftrace_dump, int, 0644); -int rcu_cpu_stall_suppress __read_mostly; // !0 = suppress stall warnings. +int rcu_cpu_stall_suppress __asi_not_sensitive_readmostly; // !0 = suppress stall warnings. EXPORT_SYMBOL_GPL(rcu_cpu_stall_suppress); module_param(rcu_cpu_stall_suppress, int, 0644); int rcu_cpu_stall_timeout __read_mostly = CONFIG_RCU_CPU_STALL_TIMEOUT; diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c index c2b2859ddd82..6c3585053f05 100644 --- a/kernel/sched/clock.c +++ b/kernel/sched/clock.c @@ -84,7 +84,7 @@ static int __sched_clock_stable_early = 1; /* * We want: ktime_get_ns() + __gtod_offset == sched_clock() + __sched_clock_offset */ -__read_mostly u64 __sched_clock_offset; +__asi_not_sensitive u64 __sched_clock_offset; static __read_mostly u64 __gtod_offset; struct sched_clock_data { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 44ea197c16ea..e1c08ff4130e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -76,9 +76,9 @@ __read_mostly int sysctl_resched_latency_warn_once = 1; * Limited because this is done with IRQs disabled. */ #ifdef CONFIG_PREEMPT_RT -const_debug unsigned int sysctl_sched_nr_migrate = 8; +unsigned int sysctl_sched_nr_migrate __asi_not_sensitive_readmostly = 8; #else -const_debug unsigned int sysctl_sched_nr_migrate = 32; +unsigned int sysctl_sched_nr_migrate __asi_not_sensitive_readmostly = 32; #endif /* @@ -9254,7 +9254,7 @@ int in_sched_functions(unsigned long addr) * Default task group. * Every task in system belongs to this group at bootup. */ -struct task_group root_task_group; +struct task_group root_task_group __asi_not_sensitive; LIST_HEAD(task_groups); /* Cacheline aligned slab cache for task_group */ diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index 893eece65bfd..6e3da149125c 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -50,7 +50,7 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca) } static DEFINE_PER_CPU(struct cpuacct_usage, root_cpuacct_cpuusage); -static struct cpuacct root_cpuacct = { +static struct cpuacct root_cpuacct __asi_not_sensitive = { .cpustat = &kernel_cpustat, .cpuusage = &root_cpuacct_cpuusage, }; diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 9392aea1804e..623b5feb142a 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -19,7 +19,7 @@ */ DEFINE_PER_CPU(struct irqtime, cpu_irqtime); -static int sched_clock_irqtime; +static int __asi_not_sensitive sched_clock_irqtime; void enable_sched_clock_irqtime(void) { diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6e476f6d9435..dc9b6133b059 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -35,7 +35,7 @@ * * (default: 6ms * (1 + ilog(ncpus)), units: nanoseconds) */ -unsigned int sysctl_sched_latency = 6000000ULL; +__asi_not_sensitive unsigned int sysctl_sched_latency = 6000000ULL; static unsigned int normalized_sysctl_sched_latency = 6000000ULL; /* @@ -90,7 +90,7 @@ unsigned int sysctl_sched_child_runs_first __read_mostly; unsigned int sysctl_sched_wakeup_granularity = 1000000UL; static unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL; -const_debug unsigned int sysctl_sched_migration_cost = 500000UL; +unsigned int sysctl_sched_migration_cost __asi_not_sensitive_readmostly = 500000UL; int sched_thermal_decay_shift; static int __init setup_sched_thermal_decay_shift(char *str) diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c index 954b229868d9..af71cde93e98 100644 --- a/kernel/sched/loadavg.c +++ b/kernel/sched/loadavg.c @@ -57,7 +57,7 @@ /* Variables and functions for calc_load */ atomic_long_t calc_load_tasks; -unsigned long calc_load_update; +unsigned long calc_load_update __asi_not_sensitive; unsigned long avenrun[3]; EXPORT_SYMBOL(avenrun); /* should be removed */ diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index b48baaba2fc2..9d5fbe66d355 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -14,7 +14,7 @@ static const u64 max_rt_runtime = MAX_BW; static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun); -struct rt_bandwidth def_rt_bandwidth; +struct rt_bandwidth def_rt_bandwidth __asi_not_sensitive; static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer) { diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0e66749486e7..517c70a29a57 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2379,8 +2379,8 @@ extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags); extern void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags); -extern const_debug unsigned int sysctl_sched_nr_migrate; -extern const_debug unsigned int sysctl_sched_migration_cost; +extern unsigned int sysctl_sched_nr_migrate; +extern unsigned int sysctl_sched_migration_cost; #ifdef CONFIG_SCHED_DEBUG extern unsigned int sysctl_sched_latency; diff --git a/kernel/smp.c b/kernel/smp.c index 01a7c1706a58..c51fd981a4a9 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -1070,7 +1070,7 @@ static int __init maxcpus(char *str) early_param("maxcpus", maxcpus); /* Setup number of possible processor ids */ -unsigned int nr_cpu_ids __read_mostly = NR_CPUS; +unsigned int nr_cpu_ids __asi_not_sensitive = NR_CPUS; EXPORT_SYMBOL(nr_cpu_ids); /* An arch may set nr_cpu_ids earlier if needed, so this would be redundant */ diff --git a/kernel/softirq.c b/kernel/softirq.c index 41f470929e99..c462b7fab4d3 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -56,7 +56,8 @@ DEFINE_PER_CPU_ALIGNED(irq_cpustat_t, irq_stat); EXPORT_PER_CPU_SYMBOL(irq_stat); #endif -static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp; +static struct softirq_action softirq_vec[NR_SOFTIRQS] +__asi_not_sensitive ____cacheline_aligned; DEFINE_PER_CPU(struct task_struct *, ksoftirqd); diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 0ea8702eb516..8b176f5c01f2 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -706,7 +706,7 @@ hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal) * High resolution timer enabled ? */ static bool hrtimer_hres_enabled __read_mostly = true; -unsigned int hrtimer_resolution __read_mostly = LOW_RES_NSEC; +unsigned int hrtimer_resolution __asi_not_sensitive = LOW_RES_NSEC; EXPORT_SYMBOL_GPL(hrtimer_resolution); /* diff --git a/kernel/time/jiffies.c b/kernel/time/jiffies.c index bc4db9e5ab70..c60f8da1cfb5 100644 --- a/kernel/time/jiffies.c +++ b/kernel/time/jiffies.c @@ -40,7 +40,13 @@ static struct clocksource clocksource_jiffies = { .max_cycles = 10, }; -__cacheline_aligned_in_smp DEFINE_RAW_SPINLOCK(jiffies_lock); +/* TODO(oweisse): __cacheline_aligned_in_smp is expanded to + __section__(".data..cacheline_aligned"))) which is at odds with + __asi_not_sensitive. We should consider instead using + __attribute__ ((__aligned__(XXX))) where XXX is a def for cacheline or + something*/ +/* __cacheline_aligned_in_smp */ +__asi_not_sensitive DEFINE_RAW_SPINLOCK(jiffies_lock); __cacheline_aligned_in_smp seqcount_raw_spinlock_t jiffies_seq = SEQCNT_RAW_SPINLOCK_ZERO(jiffies_seq, &jiffies_lock); diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c index 406dccb79c2b..23711fb94323 100644 --- a/kernel/time/ntp.c +++ b/kernel/time/ntp.c @@ -31,13 +31,13 @@ /* USER_HZ period (usecs): */ -unsigned long tick_usec = USER_TICK_USEC; +unsigned long tick_usec __asi_not_sensitive = USER_TICK_USEC; /* SHIFTED_HZ period (nsecs): */ -unsigned long tick_nsec; +unsigned long tick_nsec __asi_not_sensitive; -static u64 tick_length; -static u64 tick_length_base; +static u64 tick_length __asi_not_sensitive; +static u64 tick_length_base __asi_not_sensitive; #define SECS_PER_DAY 86400 #define MAX_TICKADJ 500LL /* usecs */ @@ -54,36 +54,36 @@ static u64 tick_length_base; * * (TIME_ERROR prevents overwriting the CMOS clock) */ -static int time_state = TIME_OK; +static int time_state __asi_not_sensitive = TIME_OK; /* clock status bits: */ -static int time_status = STA_UNSYNC; +static int time_status __asi_not_sensitive = STA_UNSYNC; /* time adjustment (nsecs): */ -static s64 time_offset; +static s64 time_offset __asi_not_sensitive; /* pll time constant: */ -static long time_constant = 2; +static long time_constant __asi_not_sensitive = 2; /* maximum error (usecs): */ -static long time_maxerror = NTP_PHASE_LIMIT; +static long time_maxerror __asi_not_sensitive = NTP_PHASE_LIMIT; /* estimated error (usecs): */ -static long time_esterror = NTP_PHASE_LIMIT; +static long time_esterror __asi_not_sensitive = NTP_PHASE_LIMIT; /* frequency offset (scaled nsecs/secs): */ -static s64 time_freq; +static s64 time_freq __asi_not_sensitive; /* time at last adjustment (secs): */ -static time64_t time_reftime; +static time64_t time_reftime __asi_not_sensitive; -static long time_adjust; +static long time_adjust __asi_not_sensitive; /* constant (boot-param configurable) NTP tick adjustment (upscaled) */ -static s64 ntp_tick_adj; +static s64 ntp_tick_adj __asi_not_sensitive; /* second value of the next pending leapsecond, or TIME64_MAX if no leap */ -static time64_t ntp_next_leap_sec = TIME64_MAX; +static time64_t ntp_next_leap_sec __asi_not_sensitive = TIME64_MAX; #ifdef CONFIG_NTP_PPS diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c index 46789356f856..cbe75661ca74 100644 --- a/kernel/time/tick-common.c +++ b/kernel/time/tick-common.c @@ -31,7 +31,7 @@ DEFINE_PER_CPU(struct tick_device, tick_cpu_device); * CPU which handles the tick and protected by jiffies_lock. There is * no requirement to write hold the jiffies seqcount for it. */ -ktime_t tick_next_period; +ktime_t tick_next_period __asi_not_sensitive; /* * tick_do_timer_cpu is a timer core internal variable which holds the CPU NR @@ -47,7 +47,7 @@ ktime_t tick_next_period; * at it will take over and keep the time keeping alive. The handover * procedure also covers cpu hotplug. */ -int tick_do_timer_cpu __read_mostly = TICK_DO_TIMER_BOOT; +int tick_do_timer_cpu __asi_not_sensitive_readmostly = TICK_DO_TIMER_BOOT; #ifdef CONFIG_NO_HZ_FULL /* * tick_do_timer_boot_cpu indicates the boot CPU temporarily owns diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h index 649f2b48e8f0..ed7e2a18060a 100644 --- a/kernel/time/tick-internal.h +++ b/kernel/time/tick-internal.h @@ -15,7 +15,7 @@ DECLARE_PER_CPU(struct tick_device, tick_cpu_device); extern ktime_t tick_next_period; -extern int tick_do_timer_cpu __read_mostly; +extern int tick_do_timer_cpu; extern void tick_setup_periodic(struct clock_event_device *dev, int broadcast); extern void tick_handle_periodic(struct clock_event_device *dev); diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 17a283ce2b20..c23fecbb68c2 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -49,7 +49,7 @@ struct tick_sched *tick_get_tick_sched(int cpu) * jiffies_lock and jiffies_seq. tick_nohz_next_event() needs to get a * consistent view of jiffies and last_jiffies_update. */ -static ktime_t last_jiffies_update; +static ktime_t last_jiffies_update __asi_not_sensitive; /* * Must be called with interrupts disabled ! diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index dcdcb85121e4..120395965e45 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -39,7 +39,7 @@ enum timekeeping_adv_mode { TK_ADV_FREQ }; -DEFINE_RAW_SPINLOCK(timekeeper_lock); +__asi_not_sensitive DEFINE_RAW_SPINLOCK(timekeeper_lock); /* * The most important data for readout fits into a single 64 byte @@ -48,14 +48,14 @@ DEFINE_RAW_SPINLOCK(timekeeper_lock); static struct { seqcount_raw_spinlock_t seq; struct timekeeper timekeeper; -} tk_core ____cacheline_aligned = { +} tk_core ____cacheline_aligned __asi_not_sensitive = { .seq = SEQCNT_RAW_SPINLOCK_ZERO(tk_core.seq, &timekeeper_lock), }; -static struct timekeeper shadow_timekeeper; +static struct timekeeper shadow_timekeeper __asi_not_sensitive; /* flag for if timekeeping is suspended */ -int __read_mostly timekeeping_suspended; +int __asi_not_sensitive_readmostly timekeeping_suspended; /** * struct tk_fast - NMI safe timekeeper @@ -72,7 +72,7 @@ struct tk_fast { }; /* Suspend-time cycles value for halted fast timekeeper. */ -static u64 cycles_at_suspend; +static u64 cycles_at_suspend __asi_not_sensitive; static u64 dummy_clock_read(struct clocksource *cs) { diff --git a/kernel/time/timekeeping.h b/kernel/time/timekeeping.h index 543beba096c7..b32ee75808fe 100644 --- a/kernel/time/timekeeping.h +++ b/kernel/time/timekeeping.h @@ -26,7 +26,7 @@ extern void update_process_times(int user); extern void do_timer(unsigned long ticks); extern void update_wall_time(void); -extern raw_spinlock_t jiffies_lock; +extern __asi_not_sensitive raw_spinlock_t jiffies_lock; extern seqcount_raw_spinlock_t jiffies_seq; #define CS_NAME_LEN 32 diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 85f1021ad459..0b09c99b568c 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -56,7 +56,7 @@ #define CREATE_TRACE_POINTS #include -__visible u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES; +u64 jiffies_64 __asi_not_sensitive ____cacheline_aligned = INITIAL_JIFFIES; EXPORT_SYMBOL(jiffies_64); diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 78ea542ce3bc..eaec3814c5a4 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -432,7 +432,7 @@ EXPORT_SYMBOL_GPL(unregister_ftrace_export); * The global_trace is the descriptor that holds the top-level tracing * buffers for the live tracing. */ -static struct trace_array global_trace = { +static struct trace_array global_trace __asi_not_sensitive = { .trace_flags = TRACE_DEFAULT_FLAGS, }; diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c index e304196d7c28..d49db8e2430a 100644 --- a/kernel/trace/trace_sched_switch.c +++ b/kernel/trace/trace_sched_switch.c @@ -16,8 +16,8 @@ #define RECORD_CMDLINE 1 #define RECORD_TGID 2 -static int sched_cmdline_ref; -static int sched_tgid_ref; +static int sched_cmdline_ref __asi_not_sensitive; +static int sched_tgid_ref __asi_not_sensitive; static DEFINE_MUTEX(sched_register_mutex); static void diff --git a/lib/debug_locks.c b/lib/debug_locks.c index a75ee30b77cb..f2d217859be6 100644 --- a/lib/debug_locks.c +++ b/lib/debug_locks.c @@ -14,6 +14,7 @@ #include #include #include +#include /* * We want to turn all lock-debugging facilities on/off at once, @@ -22,7 +23,7 @@ * that would just muddy the log. So we report the first one and * shut up after that. */ -int debug_locks __read_mostly = 1; +int debug_locks __asi_not_sensitive_readmostly = 1; EXPORT_SYMBOL_GPL(debug_locks); /* @@ -30,7 +31,7 @@ EXPORT_SYMBOL_GPL(debug_locks); * 'silent failure': nothing is printed to the console when * a locking bug is detected. */ -int debug_locks_silent __read_mostly; +int debug_locks_silent __asi_not_sensitive_readmostly; EXPORT_SYMBOL_GPL(debug_locks_silent); /* diff --git a/mm/memory.c b/mm/memory.c index 667ece86e051..5aa39d0aba2b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -152,7 +152,7 @@ static int __init disable_randmaps(char *s) } __setup("norandmaps", disable_randmaps); -unsigned long zero_pfn __read_mostly; +unsigned long zero_pfn __asi_not_sensitive; EXPORT_SYMBOL(zero_pfn); unsigned long highest_memmap_pfn __read_mostly; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 998ff6a56732..9c850b8bd1fc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -183,7 +183,7 @@ unsigned long totalreserve_pages __read_mostly; unsigned long totalcma_pages __read_mostly; int percpu_pagelist_high_fraction; -gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK; +gfp_t gfp_allowed_mask __asi_not_sensitive_readmostly = GFP_BOOT_MASK; DEFINE_STATIC_KEY_MAYBE(CONFIG_INIT_ON_ALLOC_DEFAULT_ON, init_on_alloc); EXPORT_SYMBOL(init_on_alloc); diff --git a/mm/sparse.c b/mm/sparse.c index e5c84b0cf0c9..64dcf7fceaed 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -24,10 +24,10 @@ * 1) mem_section - memory sections, mem_map's for valid memory */ #ifdef CONFIG_SPARSEMEM_EXTREME -struct mem_section **mem_section; +struct mem_section **mem_section __asi_not_sensitive; #else struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT] - ____cacheline_internodealigned_in_smp; + ____cacheline_internodealigned_in_smp __asi_not_sensitive; #endif EXPORT_SYMBOL(mem_section); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index e8e9c8588908..0af973b950c2 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -3497,7 +3497,7 @@ static int kvm_vcpu_release(struct inode *inode, struct file *filp) return 0; } -static struct file_operations kvm_vcpu_fops = { +static struct file_operations kvm_vcpu_fops __asi_not_sensitive = { .release = kvm_vcpu_release, .unlocked_ioctl = kvm_vcpu_ioctl, .mmap = kvm_vcpu_mmap, From patchwork Wed Feb 23 05:22:18 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756449 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1DD12C433F5 for ; Wed, 23 Feb 2022 05:28:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238342AbiBWF2g (ORCPT ); Wed, 23 Feb 2022 00:28:36 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58124 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238376AbiBWF1T (ORCPT ); Wed, 23 Feb 2022 00:27:19 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7D7986D4EA for ; Tue, 22 Feb 2022 21:25:26 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id b11-20020a5b008b000000b00624ea481d55so2533763ybp.19 for ; Tue, 22 Feb 2022 21:25:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=OyDHTZ7Ac6r1LQol+lgAiZ5efN30RaXmeX7z0hPyuqA=; b=DdkDTnIiPn6TLdbXCmFvIo19iJFubFIhdETlSTTLRRjgZ4Sbw2XRK1yw+KEVLoro4T wP84t15CnH7WoNP9nOE1xCfFjeXiFihRdrvEeEaqbhIrww6MkpxMl0Beadw+zBAHg4Wm Y49K0SbIs0XAjHP26SPAuOOPd/FRd4AzW+548DKIXzxHOTL9Gnv2I7DesA74Y2aSN1w7 bukmgHnjTzQ3TsZeAD6EjoZj4bz08/67lWCDyTLOclgldihhQFgdX/f7KijewY11x+pe ztyxmT3V40T+w3mwFuSkO23/qUqxo0xoLiD6fHheWqBlYJ4PhhilLPTmelIkwEUqdEVv YRGg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=OyDHTZ7Ac6r1LQol+lgAiZ5efN30RaXmeX7z0hPyuqA=; b=Q/OeopjTqdqlThJK85rRc1a1W7ua51QstB6p+iS22B1MpQy/GXaKS5rix979CGDomg rRnuCTP4dy2kGiHVgfXHwFmyIVzO7s/QUP2xeT+plHBVyvY6U4nrMb9dT1tfmaUBE/D7 3bWjatswTsm8g28Ft+yvyVigRXRXSjnLQvGNUzBI2qHTW3eaXpRTg2+qSkFnjqbu9vDt lH0zkX26+S0q/2pkR41zbjmaRFn1ayV4q7phIPwggqHTKcZlhu1mJ+IHdW1X3TKuUOyt CsA5cJ43KE9yshHfNQJ8bQbRy8f0ZALxVEIMRGPHZQNnCivvTw1/hQeT4Yr0gAV3M6lF OXRA== X-Gm-Message-State: AOAM5339bUPTve32ERqhlQ4Xqo1vJ/8rtFmPBxYe2vfwTIQUiVr94hLq ioAFeMs3iWoyFlfvkUuOiKjrI/EUj18e X-Google-Smtp-Source: ABdhPJzJ67z80Uc/VNlFr/C8BlwDxiZI69VgqPgglpNrTFnLyw6IzyUmF2cWMgeuQwRsundlA5eSPH6P2ztJ X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a81:e45:0:b0:2d6:bc2e:3f66 with SMTP id 66-20020a810e45000000b002d6bc2e3f66mr22941292ywo.54.1645593919291; Tue, 22 Feb 2022 21:25:19 -0800 (PST) Date: Tue, 22 Feb 2022 21:22:18 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-43-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 42/47] mm: asi: Annotation of PERCPU variables to be nonsensitive From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: Ofir Weisse , kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Ofir Weisse The heart of ASI is to diffrentiate between sensitive and non-sensitive data access. This commit marks certain static PERCPU variables as not sensitive. Some static variables are accessed frequently and therefore would cause many ASI exits. The frequency of these accesses is monitored by tracing asi_exits and analyzing the accessed addresses. Many of these variables don't contain sensitive information and can therefore be mapped into the global ASI region. This commit modified DEFINE_PER_CPU --> DEFINE_PER_CPU_ASI_NOT_SENSITIVE to variables which are frequenmtly-accessed yet not sensitive variables. The end result is a very significant reduction in ASI exits on real benchmarks. Signed-off-by: Ofir Weisse --- arch/x86/events/core.c | 2 +- arch/x86/events/intel/bts.c | 2 +- arch/x86/events/perf_event.h | 2 +- arch/x86/include/asm/asi.h | 2 +- arch/x86/include/asm/current.h | 2 +- arch/x86/include/asm/debugreg.h | 2 +- arch/x86/include/asm/desc.h | 2 +- arch/x86/include/asm/fpu/api.h | 2 +- arch/x86/include/asm/hardirq.h | 2 +- arch/x86/include/asm/hw_irq.h | 2 +- arch/x86/include/asm/percpu.h | 2 +- arch/x86/include/asm/preempt.h | 2 +- arch/x86/include/asm/processor.h | 12 ++++++------ arch/x86/include/asm/smp.h | 2 +- arch/x86/include/asm/tlbflush.h | 4 ++-- arch/x86/include/asm/topology.h | 2 +- arch/x86/kernel/apic/apic.c | 2 +- arch/x86/kernel/apic/x2apic_cluster.c | 6 +++--- arch/x86/kernel/cpu/common.c | 12 ++++++------ arch/x86/kernel/fpu/core.c | 2 +- arch/x86/kernel/hw_breakpoint.c | 2 +- arch/x86/kernel/irq.c | 2 +- arch/x86/kernel/irqinit.c | 2 +- arch/x86/kernel/nmi.c | 6 +++--- arch/x86/kernel/process.c | 4 ++-- arch/x86/kernel/setup_percpu.c | 4 ++-- arch/x86/kernel/smpboot.c | 3 ++- arch/x86/kernel/tsc.c | 2 +- arch/x86/kvm/x86.c | 2 +- arch/x86/kvm/x86.h | 2 +- arch/x86/mm/asi.c | 2 +- arch/x86/mm/init.c | 2 +- arch/x86/mm/tlb.c | 2 +- include/asm-generic/irq_regs.h | 2 +- include/linux/arch_topology.h | 2 +- include/linux/hrtimer.h | 2 +- include/linux/interrupt.h | 2 +- include/linux/kernel_stat.h | 4 ++-- include/linux/prandom.h | 2 +- kernel/events/core.c | 6 +++--- kernel/irq_work.c | 6 +++--- kernel/rcu/tree.c | 2 +- kernel/sched/core.c | 6 +++--- kernel/sched/cpufreq.c | 3 ++- kernel/sched/cputime.c | 2 +- kernel/sched/sched.h | 21 +++++++++++---------- kernel/sched/topology.c | 14 +++++++------- kernel/smp.c | 7 ++++--- kernel/softirq.c | 2 +- kernel/time/hrtimer.c | 2 +- kernel/time/tick-common.c | 2 +- kernel/time/tick-internal.h | 4 ++-- kernel/time/tick-sched.c | 2 +- kernel/time/timer.c | 2 +- kernel/trace/trace.c | 2 +- kernel/trace/trace_preemptirq.c | 2 +- kernel/watchdog.c | 12 ++++++------ lib/irq_regs.c | 2 +- lib/random32.c | 3 ++- virt/kvm/kvm_main.c | 2 +- 60 files changed, 112 insertions(+), 107 deletions(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index db825bf053fd..2d9829d774d7 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -47,7 +47,7 @@ struct x86_pmu x86_pmu __asi_not_sensitive_readmostly; static struct pmu pmu; -DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events) = { +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct cpu_hw_events, cpu_hw_events) = { .enabled = 1, .pmu = &pmu, }; diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c index 974e917e65b2..06d9de514b0d 100644 --- a/arch/x86/events/intel/bts.c +++ b/arch/x86/events/intel/bts.c @@ -36,7 +36,7 @@ enum { BTS_STATE_ACTIVE, }; -static DEFINE_PER_CPU(struct bts_ctx, bts_ctx); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct bts_ctx, bts_ctx); #define BTS_RECORD_SIZE 24 #define BTS_SAFETY_MARGIN 4080 diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 27cca7fd6f17..9a4855e6ffa6 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -1036,7 +1036,7 @@ static inline bool x86_pmu_has_lbr_callstack(void) x86_pmu.lbr_sel_map[PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT] > 0; } -DECLARE_PER_CPU(struct cpu_hw_events, cpu_hw_events); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct cpu_hw_events, cpu_hw_events); int x86_perf_event_set_period(struct perf_event *event); diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h index d43f6aadffee..6148e65fb0c2 100644 --- a/arch/x86/include/asm/asi.h +++ b/arch/x86/include/asm/asi.h @@ -52,7 +52,7 @@ struct asi_pgtbl_pool { uint count; }; -DECLARE_PER_CPU_ALIGNED(struct asi_state, asi_cpu_state); +DECLARE_PER_CPU_ALIGNED_ASI_NOT_SENSITIVE(struct asi_state, asi_cpu_state); extern pgd_t asi_global_nonsensitive_pgd[]; diff --git a/arch/x86/include/asm/current.h b/arch/x86/include/asm/current.h index 3e204e6140b5..a4bcf1f305bf 100644 --- a/arch/x86/include/asm/current.h +++ b/arch/x86/include/asm/current.h @@ -8,7 +8,7 @@ #ifndef __ASSEMBLY__ struct task_struct; -DECLARE_PER_CPU(struct task_struct *, current_task); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct task_struct *, current_task); static __always_inline struct task_struct *get_current(void) { diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h index cfdf307ddc01..fa67db27b098 100644 --- a/arch/x86/include/asm/debugreg.h +++ b/arch/x86/include/asm/debugreg.h @@ -6,7 +6,7 @@ #include #include -DECLARE_PER_CPU(unsigned long, cpu_dr7); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, cpu_dr7); #ifndef CONFIG_PARAVIRT_XXL /* diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h index ab97b22ac04a..7d9fff8c9543 100644 --- a/arch/x86/include/asm/desc.h +++ b/arch/x86/include/asm/desc.h @@ -298,7 +298,7 @@ static inline void native_load_tls(struct thread_struct *t, unsigned int cpu) gdt[GDT_ENTRY_TLS_MIN + i] = t->tls_array[i]; } -DECLARE_PER_CPU(bool, __tss_limit_invalid); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(bool, __tss_limit_invalid); static inline void force_reload_TR(void) { diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h index 6f5ca3c2ef4a..15abb1b05fbc 100644 --- a/arch/x86/include/asm/fpu/api.h +++ b/arch/x86/include/asm/fpu/api.h @@ -121,7 +121,7 @@ static inline void fpstate_init_soft(struct swregs_state *soft) {} #endif /* State tracking */ -DECLARE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct fpu *, fpu_fpregs_owner_ctx); /* Process cleanup */ #ifdef CONFIG_X86_64 diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h index 275e7fd20310..2f70deca4a20 100644 --- a/arch/x86/include/asm/hardirq.h +++ b/arch/x86/include/asm/hardirq.h @@ -46,7 +46,7 @@ typedef struct { #endif } ____cacheline_aligned irq_cpustat_t; -DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat); +DECLARE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE(irq_cpustat_t, irq_stat); #define __ARCH_IRQ_STAT diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h index d465ece58151..e561abfce735 100644 --- a/arch/x86/include/asm/hw_irq.h +++ b/arch/x86/include/asm/hw_irq.h @@ -128,7 +128,7 @@ extern char spurious_entries_start[]; #define VECTOR_RETRIGGERED ((void *)-2L) typedef struct irq_desc* vector_irq_t[NR_VECTORS]; -DECLARE_PER_CPU(vector_irq_t, vector_irq); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(vector_irq_t, vector_irq); #endif /* !ASSEMBLY_ */ diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index a3c33b79fb86..f9486bbe8a76 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -390,7 +390,7 @@ static inline bool x86_this_cpu_variable_test_bit(int nr, #include /* We can use this directly for local CPU (faster). */ -DECLARE_PER_CPU_READ_MOSTLY(unsigned long, this_cpu_off); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, this_cpu_off); #endif /* !__ASSEMBLY__ */ diff --git a/arch/x86/include/asm/preempt.h b/arch/x86/include/asm/preempt.h index fe5efbcba824..204a8532b870 100644 --- a/arch/x86/include/asm/preempt.h +++ b/arch/x86/include/asm/preempt.h @@ -7,7 +7,7 @@ #include #include -DECLARE_PER_CPU(int, __preempt_count); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(int, __preempt_count); /* We use the MSB mostly because its available */ #define PREEMPT_NEED_RESCHED 0x80000000 diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 20116efd2756..63831f9a503b 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -417,14 +417,14 @@ struct tss_struct { struct x86_io_bitmap io_bitmap; } __aligned(PAGE_SIZE); -DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss_rw); +DECLARE_PER_CPU_PAGE_ALIGNED_ASI_NOT_SENSITIVE(struct tss_struct, cpu_tss_rw); /* Per CPU interrupt stacks */ struct irq_stack { char stack[IRQ_STACK_SIZE]; } __aligned(IRQ_STACK_SIZE); -DECLARE_PER_CPU(unsigned long, cpu_current_top_of_stack); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, cpu_current_top_of_stack); #ifdef CONFIG_X86_64 struct fixed_percpu_data { @@ -448,8 +448,8 @@ static inline unsigned long cpu_kernelmode_gs_base(int cpu) return (unsigned long)per_cpu(fixed_percpu_data.gs_base, cpu); } -DECLARE_PER_CPU(void *, hardirq_stack_ptr); -DECLARE_PER_CPU(bool, hardirq_stack_inuse); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(void *, hardirq_stack_ptr); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(bool, hardirq_stack_inuse); extern asmlinkage void ignore_sysret(void); /* Save actual FS/GS selectors and bases to current->thread */ @@ -458,8 +458,8 @@ void current_save_fsgs(void); #ifdef CONFIG_STACKPROTECTOR DECLARE_PER_CPU(unsigned long, __stack_chk_guard); #endif -DECLARE_PER_CPU(struct irq_stack *, hardirq_stack_ptr); -DECLARE_PER_CPU(struct irq_stack *, softirq_stack_ptr); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct irq_stack *, hardirq_stack_ptr); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct irq_stack *, softirq_stack_ptr); #endif /* !X86_64 */ struct perf_event; diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index 81a0211a372d..8d85a918532e 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -19,7 +19,7 @@ DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map); DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_l2c_shared_map); DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id); DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_l2c_id); -DECLARE_PER_CPU_READ_MOSTLY(int, cpu_number); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(int, cpu_number); static inline struct cpumask *cpu_llc_shared_mask(int cpu) { diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 7d04aa2a5f86..adcdeb58d817 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -151,7 +151,7 @@ struct tlb_state { */ struct tlb_context ctxs[TLB_NR_DYN_ASIDS]; }; -DECLARE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate); +DECLARE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE(struct tlb_state, cpu_tlbstate); struct tlb_state_shared { /* @@ -171,7 +171,7 @@ struct tlb_state_shared { */ bool is_lazy; }; -DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared); +DECLARE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE(struct tlb_state_shared, cpu_tlbstate_shared); bool nmi_uaccess_okay(void); #define nmi_uaccess_okay nmi_uaccess_okay diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index cc164777e661..bff1a9123469 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -203,7 +203,7 @@ DECLARE_STATIC_KEY_FALSE(arch_scale_freq_key); #define arch_scale_freq_invariant() static_branch_likely(&arch_scale_freq_key) -DECLARE_PER_CPU(unsigned long, arch_freq_scale); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, arch_freq_scale); static inline long arch_scale_freq_capacity(int cpu) { diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index b70344bf6600..5fa0ce0ecfb3 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -548,7 +548,7 @@ static struct clock_event_device lapic_clockevent = { .rating = 100, .irq = -1, }; -static DEFINE_PER_CPU(struct clock_event_device, lapic_events); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct clock_event_device, lapic_events); static const struct x86_cpu_id deadline_match[] __initconst = { X86_MATCH_INTEL_FAM6_MODEL_STEPPINGS(HASWELL_X, X86_STEPPINGS(0x2, 0x2), 0x3a), /* EP */ diff --git a/arch/x86/kernel/apic/x2apic_cluster.c b/arch/x86/kernel/apic/x2apic_cluster.c index e696e22d0531..655fe820a240 100644 --- a/arch/x86/kernel/apic/x2apic_cluster.c +++ b/arch/x86/kernel/apic/x2apic_cluster.c @@ -20,10 +20,10 @@ struct cluster_mask { * x86_cpu_to_logical_apicid for all online cpus in a sequential way. * Using per cpu variable would cost one cache line per cpu. */ -static u32 *x86_cpu_to_logical_apicid __read_mostly; +static u32 *x86_cpu_to_logical_apicid __asi_not_sensitive_readmostly; -static DEFINE_PER_CPU(cpumask_var_t, ipi_mask); -static DEFINE_PER_CPU_READ_MOSTLY(struct cluster_mask *, cluster_masks); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(cpumask_var_t, ipi_mask); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct cluster_mask *, cluster_masks); static struct cluster_mask *cluster_hotplug_mask; static int x2apic_acpi_madt_oem_check(char *oem_id, char *oem_table_id) diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 0083464de5e3..471b3a42db64 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1775,17 +1775,17 @@ EXPORT_PER_CPU_SYMBOL_GPL(fixed_percpu_data); * The following percpu variables are hot. Align current_task to * cacheline size such that they fall in the same cacheline. */ -DEFINE_PER_CPU(struct task_struct *, current_task) ____cacheline_aligned = +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct task_struct *, current_task) ____cacheline_aligned = &init_task; EXPORT_PER_CPU_SYMBOL(current_task); -DEFINE_PER_CPU(void *, hardirq_stack_ptr); -DEFINE_PER_CPU(bool, hardirq_stack_inuse); +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(void *, hardirq_stack_ptr); +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(bool, hardirq_stack_inuse); -DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT; +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(int, __preempt_count) = INIT_PREEMPT_COUNT; EXPORT_PER_CPU_SYMBOL(__preempt_count); -DEFINE_PER_CPU(unsigned long, cpu_current_top_of_stack) = TOP_OF_INIT_STACK; +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, cpu_current_top_of_stack) = TOP_OF_INIT_STACK; /* May not be marked __init: used by software suspend */ void syscall_init(void) @@ -1826,7 +1826,7 @@ void syscall_init(void) #else /* CONFIG_X86_64 */ -DEFINE_PER_CPU(struct task_struct *, current_task) = &init_task; +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct task_struct *, current_task) = &init_task; EXPORT_PER_CPU_SYMBOL(current_task); DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT; EXPORT_PER_CPU_SYMBOL(__preempt_count); diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index d7859573973d..b59317c5721f 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -57,7 +57,7 @@ static DEFINE_PER_CPU(bool, in_kernel_fpu); /* * Track which context is using the FPU on the CPU: */ -DEFINE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx); +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct fpu *, fpu_fpregs_owner_ctx); struct kmem_cache *fpstate_cachep; diff --git a/arch/x86/kernel/hw_breakpoint.c b/arch/x86/kernel/hw_breakpoint.c index 668a4a6533d9..c2ceea8f6801 100644 --- a/arch/x86/kernel/hw_breakpoint.c +++ b/arch/x86/kernel/hw_breakpoint.c @@ -36,7 +36,7 @@ #include /* Per cpu debug control register value */ -DEFINE_PER_CPU(unsigned long, cpu_dr7); +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, cpu_dr7); EXPORT_PER_CPU_SYMBOL(cpu_dr7); /* Per cpu debug address registers values */ diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index 766ffe3ba313..5c5aa75050a5 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -26,7 +26,7 @@ #define CREATE_TRACE_POINTS #include -DEFINE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat); +DEFINE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE(irq_cpustat_t, irq_stat); EXPORT_PER_CPU_SYMBOL(irq_stat); atomic_t irq_err_count; diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c index beb1bada1b0a..d7893e040695 100644 --- a/arch/x86/kernel/irqinit.c +++ b/arch/x86/kernel/irqinit.c @@ -46,7 +46,7 @@ * (these are usually mapped into the 0x30-0xff vector range) */ -DEFINE_PER_CPU(vector_irq_t, vector_irq) = { +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(vector_irq_t, vector_irq) = { [0 ... NR_VECTORS - 1] = VECTOR_UNUSED, }; diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c index 4bce802d25fb..ef95071228ca 100644 --- a/arch/x86/kernel/nmi.c +++ b/arch/x86/kernel/nmi.c @@ -469,9 +469,9 @@ enum nmi_states { NMI_EXECUTING, NMI_LATCHED, }; -static DEFINE_PER_CPU(enum nmi_states, nmi_state); -static DEFINE_PER_CPU(unsigned long, nmi_cr2); -static DEFINE_PER_CPU(unsigned long, nmi_dr7); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(enum nmi_states, nmi_state); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, nmi_cr2); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, nmi_dr7); DEFINE_IDTENTRY_RAW(exc_nmi) { diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index f9bd1c3415d4..e4a32490dda0 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -56,7 +56,7 @@ * section. Since TSS's are completely CPU-local, we want them * on exact cacheline boundaries, to eliminate cacheline ping-pong. */ -__visible DEFINE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss_rw) = { +__visible DEFINE_PER_CPU_PAGE_ALIGNED_ASI_NOT_SENSITIVE(struct tss_struct, cpu_tss_rw) = { .x86_tss = { /* * .sp0 is only used when entering ring 0 from a lower @@ -77,7 +77,7 @@ __visible DEFINE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss_rw) = { }; EXPORT_PER_CPU_SYMBOL(cpu_tss_rw); -DEFINE_PER_CPU(bool, __tss_limit_invalid); +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(bool, __tss_limit_invalid); EXPORT_PER_CPU_SYMBOL_GPL(__tss_limit_invalid); void __init arch_task_cache_init(void) diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c index 7b65275544b2..13c94a512b7e 100644 --- a/arch/x86/kernel/setup_percpu.c +++ b/arch/x86/kernel/setup_percpu.c @@ -23,7 +23,7 @@ #include #include -DEFINE_PER_CPU_READ_MOSTLY(int, cpu_number); +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(int, cpu_number); EXPORT_PER_CPU_SYMBOL(cpu_number); #ifdef CONFIG_X86_64 @@ -32,7 +32,7 @@ EXPORT_PER_CPU_SYMBOL(cpu_number); #define BOOT_PERCPU_OFFSET 0 #endif -DEFINE_PER_CPU_READ_MOSTLY(unsigned long, this_cpu_off) = BOOT_PERCPU_OFFSET; +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, this_cpu_off) = BOOT_PERCPU_OFFSET; EXPORT_PER_CPU_SYMBOL(this_cpu_off); unsigned long __per_cpu_offset[NR_CPUS] __ro_after_init = { diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 617012f4619f..0cfc4fdc2476 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -2224,7 +2224,8 @@ static void disable_freq_invariance_workfn(struct work_struct *work) static DECLARE_WORK(disable_freq_invariance_work, disable_freq_invariance_workfn); -DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE; +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, arch_freq_scale) = + SCHED_CAPACITY_SCALE; void arch_scale_freq_tick(void) { diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index d7169da99b01..39c441409dec 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -59,7 +59,7 @@ struct cyc2ns { }; /* fits one cacheline */ -static DEFINE_PER_CPU_ALIGNED(struct cyc2ns, cyc2ns); +static DEFINE_PER_CPU_ALIGNED_ASI_NOT_SENSITIVE(struct cyc2ns, cyc2ns); static int __init tsc_early_khz_setup(char *buf) { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0df88eadab60..451872d178e5 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8523,7 +8523,7 @@ static void kvm_timer_init(void) kvmclock_cpu_online, kvmclock_cpu_down_prep); } -DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu); +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct kvm_vcpu *, current_vcpu); EXPORT_PER_CPU_SYMBOL_GPL(current_vcpu); int kvm_is_in_guest(void) diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 4abcd8d9836d..3d5da4daaf53 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -392,7 +392,7 @@ static inline bool kvm_cstate_in_guest(struct kvm *kvm) return kvm->arch.cstate_in_guest; } -DECLARE_PER_CPU(struct kvm_vcpu *, current_vcpu); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct kvm_vcpu *, current_vcpu); static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu) { diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c index fdc117929fc7..04628949e89d 100644 --- a/arch/x86/mm/asi.c +++ b/arch/x86/mm/asi.c @@ -20,7 +20,7 @@ static struct asi_class asi_class[ASI_MAX_NUM] __asi_not_sensitive; static DEFINE_SPINLOCK(asi_class_lock __asi_not_sensitive); -DEFINE_PER_CPU_ALIGNED(struct asi_state, asi_cpu_state); +DEFINE_PER_CPU_ALIGNED_ASI_NOT_SENSITIVE(struct asi_state, asi_cpu_state); EXPORT_PER_CPU_SYMBOL_GPL(asi_cpu_state); __aligned(PAGE_SIZE) pgd_t asi_global_nonsensitive_pgd[PTRS_PER_PGD]; diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index dfff17363365..012631d03c4f 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -1025,7 +1025,7 @@ void __init zone_sizes_init(void) free_area_init(max_zone_pfns); } -__visible DEFINE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate) = { +__visible DEFINE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE(struct tlb_state, cpu_tlbstate) = { .loaded_mm = &init_mm, .next_asid = 1, .cr4 = ~0UL, /* fail hard if we screw up cr4 shadow initialization */ diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index fcd2c8e92f83..36d41356ed04 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -972,7 +972,7 @@ static bool tlb_is_not_lazy(int cpu) static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask); -DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared); +DEFINE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE(struct tlb_state_shared, cpu_tlbstate_shared); EXPORT_PER_CPU_SYMBOL(cpu_tlbstate_shared); STATIC_NOPV void native_flush_tlb_multi(const struct cpumask *cpumask, diff --git a/include/asm-generic/irq_regs.h b/include/asm-generic/irq_regs.h index 2e7c6e89d42e..3225bdb2aefa 100644 --- a/include/asm-generic/irq_regs.h +++ b/include/asm-generic/irq_regs.h @@ -14,7 +14,7 @@ * Per-cpu current frame pointer - the location of the last exception frame on * the stack */ -DECLARE_PER_CPU(struct pt_regs *, __irq_regs); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct pt_regs *, __irq_regs); static inline struct pt_regs *get_irq_regs(void) { diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h index b97cea83b25e..35fdf256777a 100644 --- a/include/linux/arch_topology.h +++ b/include/linux/arch_topology.h @@ -23,7 +23,7 @@ static inline unsigned long topology_get_cpu_scale(int cpu) void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity); -DECLARE_PER_CPU(unsigned long, arch_freq_scale); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, arch_freq_scale); static inline unsigned long topology_get_freq_scale(int cpu) { diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h index 0ee140176f10..68b2f10aaa46 100644 --- a/include/linux/hrtimer.h +++ b/include/linux/hrtimer.h @@ -355,7 +355,7 @@ static inline void timerfd_clock_was_set(void) { } static inline void timerfd_resume(void) { } #endif -DECLARE_PER_CPU(struct tick_device, tick_cpu_device); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct tick_device, tick_cpu_device); #ifdef CONFIG_PREEMPT_RT void hrtimer_cancel_wait_running(const struct hrtimer *timer); diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index 1f22a30c0963..6ae485d2ebb3 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -554,7 +554,7 @@ extern void __raise_softirq_irqoff(unsigned int nr); extern void raise_softirq_irqoff(unsigned int nr); extern void raise_softirq(unsigned int nr); -DECLARE_PER_CPU(struct task_struct *, ksoftirqd); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct task_struct *, ksoftirqd); static inline struct task_struct *this_cpu_ksoftirqd(void) { diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index 69ae6b278464..89609dc5d30f 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -40,8 +40,8 @@ struct kernel_stat { unsigned int softirqs[NR_SOFTIRQS]; }; -DECLARE_PER_CPU(struct kernel_stat, kstat); -DECLARE_PER_CPU(struct kernel_cpustat, kernel_cpustat); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct kernel_stat, kstat); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct kernel_cpustat, kernel_cpustat); /* Must have preemption disabled for this to be meaningful. */ #define kstat_this_cpu this_cpu_ptr(&kstat) diff --git a/include/linux/prandom.h b/include/linux/prandom.h index 056d31317e49..f02392ca6dc2 100644 --- a/include/linux/prandom.h +++ b/include/linux/prandom.h @@ -16,7 +16,7 @@ void prandom_bytes(void *buf, size_t nbytes); void prandom_seed(u32 seed); void prandom_reseed_late(void); -DECLARE_PER_CPU(unsigned long, net_rand_noise); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, net_rand_noise); #define PRANDOM_ADD_NOISE(a, b, c, d) \ prandom_u32_add_noise((unsigned long)(a), (unsigned long)(b), \ diff --git a/kernel/events/core.c b/kernel/events/core.c index 6ea559b6e0f4..1914cc538cab 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -1207,7 +1207,7 @@ void perf_pmu_enable(struct pmu *pmu) pmu->pmu_enable(pmu); } -static DEFINE_PER_CPU(struct list_head, active_ctx_list); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct list_head, active_ctx_list); /* * perf_event_ctx_activate(), perf_event_ctx_deactivate(), and @@ -4007,8 +4007,8 @@ do { \ return div64_u64(dividend, divisor); } -static DEFINE_PER_CPU(int, perf_throttled_count); -static DEFINE_PER_CPU(u64, perf_throttled_seq); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(int, perf_throttled_count); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(u64, perf_throttled_seq); static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bool disable) { diff --git a/kernel/irq_work.c b/kernel/irq_work.c index f7df715ec28e..10df3577c733 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -22,9 +22,9 @@ #include #include -static DEFINE_PER_CPU(struct llist_head, raised_list); -static DEFINE_PER_CPU(struct llist_head, lazy_list); -static DEFINE_PER_CPU(struct task_struct *, irq_workd); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct llist_head, raised_list); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct llist_head, lazy_list); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct task_struct *, irq_workd); static void wake_irq_workd(void) { diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 284d2722cf0c..aee2b6994bc2 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -74,7 +74,7 @@ /* Data structures. */ -static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data) = { +static DEFINE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE(struct rcu_data, rcu_data) = { .dynticks_nesting = 1, .dynticks_nmi_nesting = DYNTICK_IRQ_NONIDLE, .dynticks = ATOMIC_INIT(1), diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e1c08ff4130e..7c96f0001c7f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -43,7 +43,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_cfs_tp); EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_se_tp); EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp); -DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DEFINE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE(struct rq, runqueues); #ifdef CONFIG_SCHED_DEBUG /* @@ -5104,8 +5104,8 @@ void sched_exec(void) #endif -DEFINE_PER_CPU(struct kernel_stat, kstat); -DEFINE_PER_CPU(struct kernel_cpustat, kernel_cpustat); +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct kernel_stat, kstat); +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct kernel_cpustat, kernel_cpustat); EXPORT_PER_CPU_SYMBOL(kstat); EXPORT_PER_CPU_SYMBOL(kernel_cpustat); diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c index 7c2fe50fd76d..c55a47f8e963 100644 --- a/kernel/sched/cpufreq.c +++ b/kernel/sched/cpufreq.c @@ -9,7 +9,8 @@ #include "sched.h" -DEFINE_PER_CPU(struct update_util_data __rcu *, cpufreq_update_util_data); +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct update_util_data __rcu *, + cpufreq_update_util_data); /** * cpufreq_add_update_util_hook - Populate the CPU's update_util_data pointer. diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 623b5feb142a..d3ad13308889 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -17,7 +17,7 @@ * task when irq is in progress while we read rq->clock. That is a worthy * compromise in place of having locks on each irq in account_system_time. */ -DEFINE_PER_CPU(struct irqtime, cpu_irqtime); +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct irqtime, cpu_irqtime); static int __asi_not_sensitive sched_clock_irqtime; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 517c70a29a57..4188c1a570db 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1360,7 +1360,7 @@ static inline void update_idle_core(struct rq *rq) static inline void update_idle_core(struct rq *rq) { } #endif -DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DECLARE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE(struct rq, runqueues); #define cpu_rq(cpu) (&per_cpu(runqueues, (cpu))) #define this_rq() this_cpu_ptr(&runqueues) @@ -1760,13 +1760,13 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag) return sd; } -DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc); -DECLARE_PER_CPU(int, sd_llc_size); -DECLARE_PER_CPU(int, sd_llc_id); -DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); -DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa); -DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); -DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct sched_domain __rcu *, sd_llc); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(int, sd_llc_size); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(int, sd_llc_id); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct sched_domain_shared __rcu *, sd_llc_shared); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct sched_domain __rcu *, sd_numa); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct sched_domain __rcu *, sd_asym_packing); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct sched_domain __rcu *, sd_asym_cpucapacity); extern struct static_key_false sched_asym_cpucapacity; struct sched_group_capacity { @@ -2753,7 +2753,7 @@ struct irqtime { struct u64_stats_sync sync; }; -DECLARE_PER_CPU(struct irqtime, cpu_irqtime); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct irqtime, cpu_irqtime); /* * Returns the irqtime minus the softirq time computed by ksoftirqd. @@ -2776,7 +2776,8 @@ static inline u64 irq_time_read(int cpu) #endif /* CONFIG_IRQ_TIME_ACCOUNTING */ #ifdef CONFIG_CPU_FREQ -DECLARE_PER_CPU(struct update_util_data __rcu *, cpufreq_update_util_data); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct update_util_data __rcu *, + cpufreq_update_util_data); /** * cpufreq_update_util - Take a note about CPU utilization changes. diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index d201a7052a29..1dcea6a6133e 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -641,13 +641,13 @@ static void destroy_sched_domains(struct sched_domain *sd) * the cpumask of the domain), this allows us to quickly tell if * two CPUs are in the same cache domain, see cpus_share_cache(). */ -DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc); -DEFINE_PER_CPU(int, sd_llc_size); -DEFINE_PER_CPU(int, sd_llc_id); -DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); -DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa); -DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); -DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity); +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct sched_domain __rcu *, sd_llc); +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(int, sd_llc_size); +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(int, sd_llc_id); +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct sched_domain_shared __rcu *, sd_llc_shared); +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct sched_domain __rcu *, sd_numa); +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct sched_domain __rcu *, sd_asym_packing); +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct sched_domain __rcu *, sd_asym_cpucapacity); DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity); static void update_top_cache_domain(int cpu) diff --git a/kernel/smp.c b/kernel/smp.c index c51fd981a4a9..3c1b328f0a09 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -92,9 +92,10 @@ struct call_function_data { cpumask_var_t cpumask_ipi; }; -static DEFINE_PER_CPU_ALIGNED(struct call_function_data, cfd_data); +static DEFINE_PER_CPU_ALIGNED_ASI_NOT_SENSITIVE(struct call_function_data, cfd_data); -static DEFINE_PER_CPU_SHARED_ALIGNED(struct llist_head, call_single_queue); +static DEFINE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE(struct llist_head, + call_single_queue); static void flush_smp_call_function_queue(bool warn_cpu_offline); @@ -464,7 +465,7 @@ static __always_inline void csd_unlock(struct __call_single_data *csd) smp_store_release(&csd->node.u_flags, 0); } -static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data); +static DEFINE_PER_CPU_SHARED_ALIGNED_ASI_NOT_SENSITIVE(call_single_data_t, csd_data); void __smp_call_single_queue(int cpu, struct llist_node *node) { diff --git a/kernel/softirq.c b/kernel/softirq.c index c462b7fab4d3..d2660a59feab 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -59,7 +59,7 @@ EXPORT_PER_CPU_SYMBOL(irq_stat); static struct softirq_action softirq_vec[NR_SOFTIRQS] __asi_not_sensitive ____cacheline_aligned; -DEFINE_PER_CPU(struct task_struct *, ksoftirqd); +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct task_struct *, ksoftirqd); const char * const softirq_to_name[NR_SOFTIRQS] = { "HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "IRQ_POLL", diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 8b176f5c01f2..74cfc89a17c4 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -65,7 +65,7 @@ * to reach a base using a clockid, hrtimer_clockid_to_base() * is used to convert from clockid to the proper hrtimer_base_type. */ -DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) = +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct hrtimer_cpu_base, hrtimer_bases) = { .lock = __RAW_SPIN_LOCK_UNLOCKED(hrtimer_bases.lock), .clock_base = diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c index cbe75661ca74..67180cb44394 100644 --- a/kernel/time/tick-common.c +++ b/kernel/time/tick-common.c @@ -25,7 +25,7 @@ /* * Tick devices */ -DEFINE_PER_CPU(struct tick_device, tick_cpu_device); +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct tick_device, tick_cpu_device); /* * Tick next event: keeps track of the tick time. It's updated by the * CPU which handles the tick and protected by jiffies_lock. There is diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h index ed7e2a18060a..6961318d41b7 100644 --- a/kernel/time/tick-internal.h +++ b/kernel/time/tick-internal.h @@ -13,7 +13,7 @@ # define TICK_DO_TIMER_NONE -1 # define TICK_DO_TIMER_BOOT -2 -DECLARE_PER_CPU(struct tick_device, tick_cpu_device); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct tick_device, tick_cpu_device); extern ktime_t tick_next_period; extern int tick_do_timer_cpu; @@ -161,7 +161,7 @@ static inline void timers_update_nohz(void) { } #define tick_nohz_active (0) #endif -DECLARE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases); +DECLARE_PER_CPU_ASI_NOT_SENSITIVE(struct hrtimer_cpu_base, hrtimer_bases); extern u64 get_next_timer_interrupt(unsigned long basej, u64 basem); void timer_clear_idle(void); diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index c23fecbb68c2..afd393b85577 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -36,7 +36,7 @@ /* * Per-CPU nohz control structure */ -static DEFINE_PER_CPU(struct tick_sched, tick_cpu_sched); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct tick_sched, tick_cpu_sched); struct tick_sched *tick_get_tick_sched(int cpu) { diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 0b09c99b568c..9567df187420 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -212,7 +212,7 @@ struct timer_base { struct hlist_head vectors[WHEEL_SIZE]; } ____cacheline_aligned; -static DEFINE_PER_CPU(struct timer_base, timer_bases[NR_BASES]); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct timer_base, timer_bases[NR_BASES]); #ifdef CONFIG_NO_HZ_COMMON diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index eaec3814c5a4..b82f478caf4e 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -106,7 +106,7 @@ dummy_set_flag(struct trace_array *tr, u32 old_flags, u32 bit, int set) * tracing is active, only save the comm when a trace event * occurred. */ -static DEFINE_PER_CPU(bool, trace_taskinfo_save); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(bool, trace_taskinfo_save); /* * Kill all tracing for good (never come back). diff --git a/kernel/trace/trace_preemptirq.c b/kernel/trace/trace_preemptirq.c index f4938040c228..177de3501677 100644 --- a/kernel/trace/trace_preemptirq.c +++ b/kernel/trace/trace_preemptirq.c @@ -17,7 +17,7 @@ #ifdef CONFIG_TRACE_IRQFLAGS /* Per-cpu variable to prevent redundant calls when IRQs already off */ -static DEFINE_PER_CPU(int, tracing_irq_cpu); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(int, tracing_irq_cpu); /* * Like trace_hardirqs_on() but without the lockdep invocation. This is diff --git a/kernel/watchdog.c b/kernel/watchdog.c index ad912511a0c0..c2bf55024202 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -174,13 +174,13 @@ static bool softlockup_initialized __read_mostly; static u64 __read_mostly sample_period; /* Timestamp taken after the last successful reschedule. */ -static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, watchdog_touch_ts); /* Timestamp of the last softlockup report. */ -static DEFINE_PER_CPU(unsigned long, watchdog_report_ts); -static DEFINE_PER_CPU(struct hrtimer, watchdog_hrtimer); -static DEFINE_PER_CPU(bool, softlockup_touch_sync); -static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts); -static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts_saved); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, watchdog_report_ts); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct hrtimer, watchdog_hrtimer); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(bool, softlockup_touch_sync); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, hrtimer_interrupts); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, hrtimer_interrupts_saved); static unsigned long soft_lockup_nmi_warn; static int __init nowatchdog_setup(char *str) diff --git a/lib/irq_regs.c b/lib/irq_regs.c index 0d545a93070e..8b3c6be06a7a 100644 --- a/lib/irq_regs.c +++ b/lib/irq_regs.c @@ -9,6 +9,6 @@ #include #ifndef ARCH_HAS_OWN_IRQ_REGS -DEFINE_PER_CPU(struct pt_regs *, __irq_regs); +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct pt_regs *, __irq_regs); EXPORT_PER_CPU_SYMBOL(__irq_regs); #endif diff --git a/lib/random32.c b/lib/random32.c index a57a0e18819d..e4c1cb1a70b4 100644 --- a/lib/random32.c +++ b/lib/random32.c @@ -339,7 +339,8 @@ struct siprand_state { }; static DEFINE_PER_CPU(struct siprand_state, net_rand_state) __latent_entropy; -DEFINE_PER_CPU(unsigned long, net_rand_noise); +/* TODO(oweisse): Is this entropy sensitive?? */ +DEFINE_PER_CPU_ASI_NOT_SENSITIVE(unsigned long, net_rand_noise); EXPORT_PER_CPU_SYMBOL(net_rand_noise); /* diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 0af973b950c2..8d2d76de5bd0 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -110,7 +110,7 @@ static atomic_t hardware_enable_failed; static struct kmem_cache *kvm_vcpu_cache; static __read_mostly struct preempt_ops kvm_preempt_ops; -static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_running_vcpu); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct kvm_vcpu *, kvm_running_vcpu); struct dentry *kvm_debugfs_dir; EXPORT_SYMBOL_GPL(kvm_debugfs_dir); From patchwork Wed Feb 23 05:22:19 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756450 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9923EC433EF for ; Wed, 23 Feb 2022 05:28:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238387AbiBWF2n (ORCPT ); Wed, 23 Feb 2022 00:28:43 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57660 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238447AbiBWF1T (ORCPT ); Wed, 23 Feb 2022 00:27:19 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0AF6F6E286 for ; Tue, 22 Feb 2022 21:25:29 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-2d306e372e5so163267937b3.5 for ; Tue, 22 Feb 2022 21:25:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=1Ak925Pk776XAKxLwtqmq+2ujMHKkPdWGXyWe6mB81I=; b=qruJowqTdMVz7fZkxeG4xVb20iJ53LrE4dT1CKYoVmqpJaSFmZqssZFPvLDlocajG+ 0ZNfohsXrp0SvmtgoKo+3GpbCl6Ek66LSSQDb5QhwDTmpTY3IZR68hTWLTakDK649FWd NcBzCkwDsR687gBw9hUAZO42EBgmkkjWUtK9c7iThbB5OJvWqulI+ZjA4At9vc3jDdQv V3RzFrXEtxS3BFQakXKK5JYagS7g37QYbi+N7FduEDd6mJWAl6cR1xbe48z+M34xGgJb +aIJO9WdvPWGVowL+cjeTIFBzTQnJ9oyaKZdWhvwquAx5l+fbfWtWxe31CFQ765daVDd WYtQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=1Ak925Pk776XAKxLwtqmq+2ujMHKkPdWGXyWe6mB81I=; b=OvgR4tD9gicEgXR0psvOqiNvEtYnXrjIGLw5yHVAjd/sUzRYGPnT4rM9YZl847ZCyj lHflg2bSnkOWcbn8B0k192V0JpHfOiC3hQJMpISjQQOEZvba81xFEHPPHbFFC6e9EwTj Vr9Vku3ybismzUwzDLzAAAitojauadAm/EOJuSfoQJlii2ngnwPmQz3roszSYkT0jGZj jkIaHbYVmI1dvUyNjazFAGKxsShJ7t9GsKVHGmqfPOAPCCe3wTZPZtfU720fd2+Nzf8b B3l0RcA6Pe1rjxtewLjhr1+75Tdkw1IWgsGyWDaYcaVLUaiVWACVU7xPgTfdwOV7aPOg ZcMQ== X-Gm-Message-State: AOAM532OQDlj9hJQVNz4wB9Gr9/Vl2qJM5+Gy9hkiQ0GY+VYsCJetNK/ BqEZThQnmyUunnkzmZRxNPK7zDb5wQ9S X-Google-Smtp-Source: ABdhPJz+8yu7dLdPv2zaC0fb4RR9so3iE3HoEaEWayOmFqlvYUtwhO3rWaAyrQMIf9/n3UzRyvpRD4FVMxcA X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a25:bad2:0:b0:620:fe28:ff53 with SMTP id a18-20020a25bad2000000b00620fe28ff53mr26733639ybk.340.1645593921610; Tue, 22 Feb 2022 21:25:21 -0800 (PST) Date: Tue, 22 Feb 2022 21:22:19 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-44-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 43/47] mm: asi: Annotation of dynamic variables to be nonsensitive From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: Ofir Weisse , kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Ofir Weisse The heart of ASI is to diffrentiate between sensitive and non-sensitive data access. This commit marks certain dynamic allocations as not sensitive. Some dynamic variables are accessed frequently and therefore would cause many ASI exits. The frequency of these accesses is monitored by tracing asi_exits and analyzing the accessed addresses. Many of these variables don't contain sensitive information and can therefore be mapped into the global ASI region. This commit adds GFP_LOCAL/GLOBAL_NONSENSITIVE attributes to these frequenmtly-accessed yet not sensitive variables. The end result is a very significant reduction in ASI exits on real benchmarks. Signed-off-by: Ofir Weisse --- arch/x86/include/asm/kvm_host.h | 3 ++- arch/x86/kernel/apic/x2apic_cluster.c | 2 +- arch/x86/kvm/cpuid.c | 4 ++- arch/x86/kvm/lapic.c | 9 ++++--- arch/x86/kvm/mmu/mmu.c | 7 ++++++ arch/x86/kvm/vmx/vmx.c | 6 +++-- arch/x86/kvm/x86.c | 8 +++--- fs/binfmt_elf.c | 2 +- fs/eventfd.c | 2 +- fs/eventpoll.c | 10 +++++--- fs/exec.c | 2 ++ fs/file.c | 3 ++- fs/timerfd.c | 2 +- include/linux/kvm_host.h | 2 +- include/linux/kvm_types.h | 3 +++ kernel/cgroup/cgroup.c | 4 +-- kernel/events/core.c | 15 +++++++---- kernel/exit.c | 2 ++ kernel/fork.c | 36 +++++++++++++++++++++------ kernel/rcu/srcutree.c | 3 ++- kernel/sched/core.c | 6 +++-- kernel/sched/cpuacct.c | 8 +++--- kernel/sched/fair.c | 3 ++- kernel/sched/topology.c | 14 +++++++---- kernel/smp.c | 17 +++++++------ kernel/trace/ring_buffer.c | 5 ++-- kernel/tracepoint.c | 2 +- lib/radix-tree.c | 6 ++--- mm/memcontrol.c | 7 +++--- mm/util.c | 3 ++- mm/vmalloc.c | 3 ++- net/core/skbuff.c | 2 +- net/core/sock.c | 2 +- virt/kvm/coalesced_mmio.c | 2 +- virt/kvm/eventfd.c | 5 ++-- virt/kvm/kvm_main.c | 12 ++++++--- 36 files changed, 148 insertions(+), 74 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index b7292c4fece7..34a05add5e77 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1562,7 +1562,8 @@ static inline void kvm_ops_static_call_update(void) #define __KVM_HAVE_ARCH_VM_ALLOC static inline struct kvm *kvm_arch_alloc_vm(void) { - return __vmalloc(kvm_x86_ops.vm_size, GFP_KERNEL_ACCOUNT | __GFP_ZERO); + return __vmalloc(kvm_x86_ops.vm_size, GFP_KERNEL_ACCOUNT | __GFP_ZERO | + __GFP_GLOBAL_NONSENSITIVE); } #define __KVM_HAVE_ARCH_VM_FREE diff --git a/arch/x86/kernel/apic/x2apic_cluster.c b/arch/x86/kernel/apic/x2apic_cluster.c index 655fe820a240..a1f6eb51ecb7 100644 --- a/arch/x86/kernel/apic/x2apic_cluster.c +++ b/arch/x86/kernel/apic/x2apic_cluster.c @@ -144,7 +144,7 @@ static int alloc_clustermask(unsigned int cpu, int node) } cluster_hotplug_mask = kzalloc_node(sizeof(*cluster_hotplug_mask), - GFP_KERNEL, node); + GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE, node); if (!cluster_hotplug_mask) return -ENOMEM; cluster_hotplug_mask->node = node; diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 07e9215e911d..dedabfdd292e 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -310,7 +310,9 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, if (IS_ERR(e)) return PTR_ERR(e); - e2 = kvmalloc_array(cpuid->nent, sizeof(*e2), GFP_KERNEL_ACCOUNT); + e2 = kvmalloc_array(cpuid->nent, sizeof(*e2), + GFP_KERNEL_ACCOUNT | + __GFP_LOCAL_NONSENSITIVE); if (!e2) { r = -ENOMEM; goto out_free_cpuid; diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 213bbdfab49e..3a550299f015 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -213,7 +213,7 @@ void kvm_recalculate_apic_map(struct kvm *kvm) new = kvzalloc(sizeof(struct kvm_apic_map) + sizeof(struct kvm_lapic *) * ((u64)max_id + 1), - GFP_KERNEL_ACCOUNT); + GFP_KERNEL_ACCOUNT | __GFP_LOCAL_NONSENSITIVE); if (!new) goto out; @@ -993,7 +993,7 @@ bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src, *r = -1; if (irq->shorthand == APIC_DEST_SELF) { - *r = kvm_apic_set_irq(src->vcpu, irq, dest_map); + *r = kvm_apic_set_irq(src->vcpu, irq, dest_map); return true; } @@ -2455,13 +2455,14 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, int timer_advance_ns) ASSERT(vcpu != NULL); - apic = kzalloc(sizeof(*apic), GFP_KERNEL_ACCOUNT); + apic = kzalloc(sizeof(*apic), GFP_KERNEL_ACCOUNT | __GFP_LOCAL_NONSENSITIVE); if (!apic) goto nomem; vcpu->arch.apic = apic; - apic->regs = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT); + apic->regs = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT + | __GFP_LOCAL_NONSENSITIVE); if (!apic->regs) { printk(KERN_ERR "malloc apic regs error for vcpu %x\n", vcpu->vcpu_id); diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 5785a0d02558..a2ada1104c2d 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -5630,6 +5630,13 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu) vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO; vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + if (static_cpu_has(X86_FEATURE_ASI) && mm_asi_enabled(current->mm)) + vcpu->arch.mmu_shadow_page_cache.gfp_asi = + __GFP_LOCAL_NONSENSITIVE; + else + vcpu->arch.mmu_shadow_page_cache.gfp_asi = 0; +#endif vcpu->arch.mmu = &vcpu->arch.root_mmu; vcpu->arch.walk_mmu = &vcpu->arch.root_mmu; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index e1ad82c25a78..6e1bb017b696 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2629,7 +2629,7 @@ void free_loaded_vmcs(struct loaded_vmcs *loaded_vmcs) free_vmcs(loaded_vmcs->vmcs); loaded_vmcs->vmcs = NULL; if (loaded_vmcs->msr_bitmap) - free_page((unsigned long)loaded_vmcs->msr_bitmap); + kfree(loaded_vmcs->msr_bitmap); WARN_ON(loaded_vmcs->shadow_vmcs != NULL); } @@ -2648,7 +2648,9 @@ int alloc_loaded_vmcs(struct loaded_vmcs *loaded_vmcs) if (cpu_has_vmx_msr_bitmap()) { loaded_vmcs->msr_bitmap = (unsigned long *) - __get_free_page(GFP_KERNEL_ACCOUNT); + kzalloc(PAGE_SIZE, + GFP_KERNEL_ACCOUNT | + __GFP_LOCAL_NONSENSITIVE ); if (!loaded_vmcs->msr_bitmap) goto out_vmcs; memset(loaded_vmcs->msr_bitmap, 0xff, PAGE_SIZE); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 451872d178e5..dd862edc1b5a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -329,7 +329,8 @@ static struct kmem_cache *kvm_alloc_emulator_cache(void) return kmem_cache_create_usercopy("x86_emulator", size, __alignof__(struct x86_emulate_ctxt), - SLAB_ACCOUNT, useroffset, + SLAB_ACCOUNT|SLAB_LOCAL_NONSENSITIVE, + useroffset, size - useroffset, NULL); } @@ -10969,7 +10970,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) r = -ENOMEM; - page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO); + page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_LOCAL_NONSENSITIVE); if (!page) goto fail_free_lapic; vcpu->arch.pio_data = page_address(page); @@ -11718,7 +11719,8 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm, lpages = __kvm_mmu_slot_lpages(slot, npages, level); - linfo = kvcalloc(lpages, sizeof(*linfo), GFP_KERNEL_ACCOUNT); + linfo = kvcalloc(lpages, sizeof(*linfo), + GFP_KERNEL_ACCOUNT | __GFP_LOCAL_NONSENSITIVE); if (!linfo) goto out_free; diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index f8c7f26f1fbb..b0550951da59 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -477,7 +477,7 @@ static struct elf_phdr *load_elf_phdrs(const struct elfhdr *elf_ex, if (size == 0 || size > 65536 || size > ELF_MIN_ALIGN) goto out; - elf_phdata = kmalloc(size, GFP_KERNEL); + elf_phdata = kmalloc(size, GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE); if (!elf_phdata) goto out; diff --git a/fs/eventfd.c b/fs/eventfd.c index 3627dd7d25db..c748433e52af 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -415,7 +415,7 @@ static int do_eventfd(unsigned int count, int flags) if (flags & ~EFD_FLAGS_SET) return -EINVAL; - ctx = kmalloc(sizeof(*ctx), GFP_KERNEL); + ctx = kmalloc(sizeof(*ctx), GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE); if (!ctx) return -ENOMEM; diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 06f4c5ae1451..b28826c9f079 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1239,7 +1239,7 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, if (unlikely(!epi)) // an earlier allocation has failed return; - pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL); + pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE); if (unlikely(!pwq)) { epq->epi = NULL; return; @@ -1453,7 +1453,8 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event, return -ENOSPC; percpu_counter_inc(&ep->user->epoll_watches); - if (!(epi = kmem_cache_zalloc(epi_cache, GFP_KERNEL))) { + if (!(epi = kmem_cache_zalloc(epi_cache, + GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE))) { percpu_counter_dec(&ep->user->epoll_watches); return -ENOMEM; } @@ -2373,11 +2374,12 @@ static int __init eventpoll_init(void) /* Allocates slab cache used to allocate "struct epitem" items */ epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem), - 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL); + 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT|SLAB_GLOBAL_NONSENSITIVE, NULL); /* Allocates slab cache used to allocate "struct eppoll_entry" */ pwq_cache = kmem_cache_create("eventpoll_pwq", - sizeof(struct eppoll_entry), 0, SLAB_PANIC|SLAB_ACCOUNT, NULL); + sizeof(struct eppoll_entry), 0, + SLAB_PANIC|SLAB_ACCOUNT|SLAB_GLOBAL_NONSENSITIVE, NULL); ephead_cache = kmem_cache_create("ep_head", sizeof(struct epitems_head), 0, SLAB_PANIC|SLAB_ACCOUNT, NULL); diff --git a/fs/exec.c b/fs/exec.c index 537d92c41105..76f3b433e80d 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1238,6 +1238,8 @@ int begin_new_exec(struct linux_binprm * bprm) struct task_struct *me = current; int retval; + /* TODO: (oweisse) unmap the stack from ASI */ + /* Once we are committed compute the creds */ retval = bprm_creds_from_file(bprm); if (retval) diff --git a/fs/file.c b/fs/file.c index 97d212a9b814..85bfa5d70323 100644 --- a/fs/file.c +++ b/fs/file.c @@ -117,7 +117,8 @@ static struct fdtable * alloc_fdtable(unsigned int nr) if (!fdt) goto out; fdt->max_fds = nr; - data = kvmalloc_array(nr, sizeof(struct file *), GFP_KERNEL_ACCOUNT); + data = kvmalloc_array(nr, sizeof(struct file *), + GFP_KERNEL_ACCOUNT | __GFP_LOCAL_NONSENSITIVE); if (!data) goto out_fdt; fdt->fd = data; diff --git a/fs/timerfd.c b/fs/timerfd.c index e9c96a0c79f1..385fbb29837d 100644 --- a/fs/timerfd.c +++ b/fs/timerfd.c @@ -425,7 +425,7 @@ SYSCALL_DEFINE2(timerfd_create, int, clockid, int, flags) !capable(CAP_WAKE_ALARM)) return -EPERM; - ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE); if (!ctx) return -ENOMEM; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index f31f7442eced..dfbb26d7a185 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1085,7 +1085,7 @@ int kvm_arch_create_vm_debugfs(struct kvm *kvm); */ static inline struct kvm *kvm_arch_alloc_vm(void) { - return kzalloc(sizeof(struct kvm), GFP_KERNEL); + return kzalloc(sizeof(struct kvm), GFP_KERNEL | __GFP_LOCAL_NONSENSITIVE); } #endif diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h index 234eab059839..a5a810db85ca 100644 --- a/include/linux/kvm_types.h +++ b/include/linux/kvm_types.h @@ -64,6 +64,9 @@ struct gfn_to_hva_cache { struct kvm_mmu_memory_cache { int nobjs; gfp_t gfp_zero; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + gfp_t gfp_asi; +#endif struct kmem_cache *kmem_cache; void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE]; }; diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 729495e17363..79692dafd2be 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1221,7 +1221,7 @@ static struct css_set *find_css_set(struct css_set *old_cset, if (cset) return cset; - cset = kzalloc(sizeof(*cset), GFP_KERNEL); + cset = kzalloc(sizeof(*cset), GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE); if (!cset) return NULL; @@ -5348,7 +5348,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name, /* allocate the cgroup and its ID, 0 is reserved for the root */ cgrp = kzalloc(struct_size(cgrp, ancestor_ids, (level + 1)), - GFP_KERNEL); + GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE); if (!cgrp) return ERR_PTR(-ENOMEM); diff --git a/kernel/events/core.c b/kernel/events/core.c index 1914cc538cab..64eeb2c67d92 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -4586,7 +4586,8 @@ alloc_perf_context(struct pmu *pmu, struct task_struct *task) { struct perf_event_context *ctx; - ctx = kzalloc(sizeof(struct perf_event_context), GFP_KERNEL); + ctx = kzalloc(sizeof(struct perf_event_context), + GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE); if (!ctx) return NULL; @@ -11062,7 +11063,8 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type) mutex_lock(&pmus_lock); ret = -ENOMEM; - pmu->pmu_disable_count = alloc_percpu(int); + pmu->pmu_disable_count = alloc_percpu_gfp(int, + GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE); if (!pmu->pmu_disable_count) goto unlock; @@ -11112,7 +11114,8 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type) goto got_cpu_context; ret = -ENOMEM; - pmu->pmu_cpu_context = alloc_percpu(struct perf_cpu_context); + pmu->pmu_cpu_context = alloc_percpu_gfp(struct perf_cpu_context, + GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE); if (!pmu->pmu_cpu_context) goto free_dev; @@ -11493,7 +11496,8 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu, } node = (cpu >= 0) ? cpu_to_node(cpu) : -1; - event = kmem_cache_alloc_node(perf_event_cache, GFP_KERNEL | __GFP_ZERO, + event = kmem_cache_alloc_node(perf_event_cache, + GFP_KERNEL | __GFP_ZERO | __GFP_GLOBAL_NONSENSITIVE, node); if (!event) return ERR_PTR(-ENOMEM); @@ -13378,7 +13382,8 @@ void __init perf_event_init(void) ret = init_hw_breakpoint(); WARN(ret, "hw_breakpoint initialization failed with: %d", ret); - perf_event_cache = KMEM_CACHE(perf_event, SLAB_PANIC); + perf_event_cache = KMEM_CACHE(perf_event, + SLAB_PANIC | SLAB_GLOBAL_NONSENSITIVE); /* * Build time assertion that we keep the data_head at the intended diff --git a/kernel/exit.c b/kernel/exit.c index f702a6a63686..ab2749cf6887 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -768,6 +768,8 @@ void __noreturn do_exit(long code) profile_task_exit(tsk); kcov_task_exit(tsk); + /* TODO: (oweisse) unmap the stack from ASI */ + coredump_task_exit(tsk); ptrace_event(PTRACE_EVENT_EXIT, code); diff --git a/kernel/fork.c b/kernel/fork.c index d7f55de00947..cb147a72372d 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -168,6 +168,8 @@ static struct kmem_cache *task_struct_cachep; static inline struct task_struct *alloc_task_struct_node(int node) { + /* TODO: Figure how to allocate this propperly to ASI process map. This + * should be mapped in a __GFP_LOCAL_NONSENSITIVE slab. */ return kmem_cache_alloc_node(task_struct_cachep, GFP_KERNEL, node); } @@ -214,6 +216,7 @@ static int free_vm_stack_cache(unsigned int cpu) static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node) { + /* TODO: (oweisse) Add annotation to map the stack into ASI */ #ifdef CONFIG_VMAP_STACK void *stack; int i; @@ -242,9 +245,13 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node) * so memcg accounting is performed manually on assigning/releasing * stacks to tasks. Drop __GFP_ACCOUNT. */ + /* ASI: We intentionally don't pass VM_LOCAL_NONSENSITIVE nor + * __GFP_LOCAL_NONSENSITIVE since we don't have an mm yet. Later on we'll + * map the stack into the mm asi map. That being said, we do care about + * the stack weing allocaed below VMALLOC_LOCAL_NONSENSITIVE_END */ stack = __vmalloc_node_range(THREAD_SIZE, THREAD_ALIGN, - VMALLOC_START, VMALLOC_END, - THREADINFO_GFP & ~__GFP_ACCOUNT, + VMALLOC_START, VMALLOC_LOCAL_NONSENSITIVE_END, + (THREADINFO_GFP & (~__GFP_ACCOUNT)), PAGE_KERNEL, 0, node, __builtin_return_address(0)); @@ -346,7 +353,8 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm) { struct vm_area_struct *vma; - vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL); + vma = kmem_cache_alloc(vm_area_cachep, + GFP_KERNEL); if (vma) vma_init(vma, mm); return vma; @@ -683,6 +691,8 @@ static void check_mm(struct mm_struct *mm) #endif } +/* TODO: (oweisse) ASI: we need to allocate mm such that it will only be visible + * within itself. */ #define allocate_mm() (kmem_cache_alloc(mm_cachep, GFP_KERNEL)) #define free_mm(mm) (kmem_cache_free(mm_cachep, (mm))) @@ -823,9 +833,12 @@ void __init fork_init(void) /* create a slab on which task_structs can be allocated */ task_struct_whitelist(&useroffset, &usersize); + /* TODO: (oweisse) for the time being this cache is shared among all tasks. We + * mark it SLAB_NONSENSITIVE so task_struct can be accessed withing ASI. + * A final secure solution should have this memory LOCAL, not GLOBAL.*/ task_struct_cachep = kmem_cache_create_usercopy("task_struct", arch_task_struct_size, align, - SLAB_PANIC|SLAB_ACCOUNT, + SLAB_PANIC|SLAB_ACCOUNT|SLAB_GLOBAL_NONSENSITIVE, useroffset, usersize, NULL); #endif @@ -1601,6 +1614,7 @@ static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk) refcount_inc(¤t->sighand->count); return 0; } + /* TODO: (oweisse) ASI replace with proper ASI allcation. */ sig = kmem_cache_alloc(sighand_cachep, GFP_KERNEL); RCU_INIT_POINTER(tsk->sighand, sig); if (!sig) @@ -1649,6 +1663,8 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) if (clone_flags & CLONE_THREAD) return 0; + /* TODO: (oweisse) figure out how to properly allocate this in ASI for local + * process */ sig = kmem_cache_zalloc(signal_cachep, GFP_KERNEL); tsk->signal = sig; if (!sig) @@ -2923,7 +2939,8 @@ void __init proc_caches_init(void) SLAB_ACCOUNT, sighand_ctor); signal_cachep = kmem_cache_create("signal_cache", sizeof(struct signal_struct), 0, - SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT| + SLAB_GLOBAL_NONSENSITIVE, NULL); files_cachep = kmem_cache_create("files_cache", sizeof(struct files_struct), 0, @@ -2941,13 +2958,18 @@ void __init proc_caches_init(void) */ mm_size = sizeof(struct mm_struct) + cpumask_size(); + /* TODO: (oweisse) ASI replace with proper ASI allcation. */ mm_cachep = kmem_cache_create_usercopy("mm_struct", mm_size, ARCH_MIN_MMSTRUCT_ALIGN, - SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT + |SLAB_GLOBAL_NONSENSITIVE, offsetof(struct mm_struct, saved_auxv), sizeof_field(struct mm_struct, saved_auxv), NULL); - vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT); + + /* TODO: (oweisse) ASI replace with proper ASI allcation. */ + vm_area_cachep = KMEM_CACHE(vm_area_struct, + SLAB_PANIC|SLAB_ACCOUNT|SLAB_LOCAL_NONSENSITIVE); mmap_init(); nsproxy_cache_init(); } diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 6833d8887181..553221503803 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -171,7 +171,8 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp, bool is_static) atomic_set(&ssp->srcu_barrier_cpu_cnt, 0); INIT_DELAYED_WORK(&ssp->work, process_srcu); if (!is_static) - ssp->sda = alloc_percpu(struct srcu_data); + ssp->sda = alloc_percpu_gfp(struct srcu_data, + GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE); if (!ssp->sda) return -ENOMEM; init_srcu_struct_nodes(ssp); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7c96f0001c7f..7515f0612f5c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -9329,7 +9329,8 @@ void __init sched_init(void) #endif /* CONFIG_RT_GROUP_SCHED */ #ifdef CONFIG_CGROUP_SCHED - task_group_cache = KMEM_CACHE(task_group, 0); + /* TODO: (oweisse) add SLAB_NONSENSITIVE */ + task_group_cache = KMEM_CACHE(task_group, SLAB_GLOBAL_NONSENSITIVE); list_add(&root_task_group.list, &task_groups); INIT_LIST_HEAD(&root_task_group.children); @@ -9741,7 +9742,8 @@ struct task_group *sched_create_group(struct task_group *parent) { struct task_group *tg; - tg = kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO); + tg = kmem_cache_alloc(task_group_cache, + GFP_KERNEL | __GFP_ZERO | __GFP_GLOBAL_NONSENSITIVE); if (!tg) return ERR_PTR(-ENOMEM); diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index 6e3da149125c..e8b0b29b4d37 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -64,15 +64,17 @@ cpuacct_css_alloc(struct cgroup_subsys_state *parent_css) if (!parent_css) return &root_cpuacct.css; - ca = kzalloc(sizeof(*ca), GFP_KERNEL); + ca = kzalloc(sizeof(*ca), GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE); if (!ca) goto out; - ca->cpuusage = alloc_percpu(struct cpuacct_usage); + ca->cpuusage = alloc_percpu_gfp(struct cpuacct_usage, + GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE); if (!ca->cpuusage) goto out_free_ca; - ca->cpustat = alloc_percpu(struct kernel_cpustat); + ca->cpustat = alloc_percpu_gfp(struct kernel_cpustat, + GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE); if (!ca->cpustat) goto out_free_cpuusage; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index dc9b6133b059..97d70f1eb2c5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -11486,7 +11486,8 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent) for_each_possible_cpu(i) { cfs_rq = kzalloc_node(sizeof(struct cfs_rq), - GFP_KERNEL, cpu_to_node(i)); + GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE, + cpu_to_node(i)); if (!cfs_rq) goto err; diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 1dcea6a6133e..2ad96c78306c 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -569,7 +569,7 @@ static struct root_domain *alloc_rootdomain(void) { struct root_domain *rd; - rd = kzalloc(sizeof(*rd), GFP_KERNEL); + rd = kzalloc(sizeof(*rd), GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE); if (!rd) return NULL; @@ -2044,21 +2044,24 @@ static int __sdt_alloc(const struct cpumask *cpu_map) struct sched_group_capacity *sgc; sd = kzalloc_node(sizeof(struct sched_domain) + cpumask_size(), - GFP_KERNEL, cpu_to_node(j)); + GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE, + cpu_to_node(j)); if (!sd) return -ENOMEM; *per_cpu_ptr(sdd->sd, j) = sd; sds = kzalloc_node(sizeof(struct sched_domain_shared), - GFP_KERNEL, cpu_to_node(j)); + GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE, + cpu_to_node(j)); if (!sds) return -ENOMEM; *per_cpu_ptr(sdd->sds, j) = sds; sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(), - GFP_KERNEL, cpu_to_node(j)); + GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE, + cpu_to_node(j)); if (!sg) return -ENOMEM; @@ -2067,7 +2070,8 @@ static int __sdt_alloc(const struct cpumask *cpu_map) *per_cpu_ptr(sdd->sg, j) = sg; sgc = kzalloc_node(sizeof(struct sched_group_capacity) + cpumask_size(), - GFP_KERNEL, cpu_to_node(j)); + GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE, + cpu_to_node(j)); if (!sgc) return -ENOMEM; diff --git a/kernel/smp.c b/kernel/smp.c index 3c1b328f0a09..db9ab5a58e2c 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -103,15 +103,18 @@ int smpcfd_prepare_cpu(unsigned int cpu) { struct call_function_data *cfd = &per_cpu(cfd_data, cpu); - if (!zalloc_cpumask_var_node(&cfd->cpumask, GFP_KERNEL, + if (!zalloc_cpumask_var_node(&cfd->cpumask, + GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE, cpu_to_node(cpu))) return -ENOMEM; - if (!zalloc_cpumask_var_node(&cfd->cpumask_ipi, GFP_KERNEL, + if (!zalloc_cpumask_var_node(&cfd->cpumask_ipi, + GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE, cpu_to_node(cpu))) { free_cpumask_var(cfd->cpumask); return -ENOMEM; } - cfd->pcpu = alloc_percpu(struct cfd_percpu); + cfd->pcpu = alloc_percpu_gfp(struct cfd_percpu, + GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE); if (!cfd->pcpu) { free_cpumask_var(cfd->cpumask); free_cpumask_var(cfd->cpumask_ipi); @@ -179,10 +182,10 @@ static int __init csdlock_debug(char *str) } early_param("csdlock_debug", csdlock_debug); -static DEFINE_PER_CPU(call_single_data_t *, cur_csd); -static DEFINE_PER_CPU(smp_call_func_t, cur_csd_func); -static DEFINE_PER_CPU(void *, cur_csd_info); -static DEFINE_PER_CPU(struct cfd_seq_local, cfd_seq_local); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(call_single_data_t *, cur_csd); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(smp_call_func_t, cur_csd_func); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(void *, cur_csd_info); +static DEFINE_PER_CPU_ASI_NOT_SENSITIVE(struct cfd_seq_local, cfd_seq_local); #define CSD_LOCK_TIMEOUT (5ULL * NSEC_PER_SEC) static atomic_t csd_bug_count = ATOMIC_INIT(0); diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 2699e9e562b1..9ad7d4569d4b 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1539,7 +1539,8 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, * gracefully without invoking oom-killer and the system is not * destabilized. */ - mflags = GFP_KERNEL | __GFP_RETRY_MAYFAIL; + /* TODO(oweisse): this is a hack to enable ASI tracing. */ + mflags = GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_GLOBAL_NONSENSITIVE; /* * If a user thread allocates too much, and si_mem_available() @@ -1718,7 +1719,7 @@ struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags, /* keep it in its own cache line */ buffer = kzalloc(ALIGN(sizeof(*buffer), cache_line_size()), - GFP_KERNEL); + GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE); if (!buffer) return NULL; diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c index 64ea283f2f86..0ae6c38ee121 100644 --- a/kernel/tracepoint.c +++ b/kernel/tracepoint.c @@ -107,7 +107,7 @@ static void tp_stub_func(void) static inline void *allocate_probes(int count) { struct tp_probes *p = kmalloc(struct_size(p, probes, count), - GFP_KERNEL); + GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE); return p == NULL ? NULL : p->probes; } diff --git a/lib/radix-tree.c b/lib/radix-tree.c index b3afafe46fff..c7d3342a7b30 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -248,8 +248,7 @@ radix_tree_node_alloc(gfp_t gfp_mask, struct radix_tree_node *parent, * cache first for the new node to get accounted to the memory * cgroup. */ - ret = kmem_cache_alloc(radix_tree_node_cachep, - gfp_mask | __GFP_NOWARN); + ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask | __GFP_NOWARN); if (ret) goto out; @@ -1597,9 +1596,10 @@ void __init radix_tree_init(void) BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32); BUILD_BUG_ON(ROOT_IS_IDR & ~GFP_ZONEMASK); BUILD_BUG_ON(XA_CHUNK_SIZE > 255); + /*TODO: (oweisse) ASI add SLAB_NONSENSITIVE */ radix_tree_node_cachep = kmem_cache_create("radix_tree_node", sizeof(struct radix_tree_node), 0, - SLAB_PANIC | SLAB_RECLAIM_ACCOUNT, + SLAB_PANIC | SLAB_RECLAIM_ACCOUNT | SLAB_GLOBAL_NONSENSITIVE, radix_tree_node_ctor); ret = cpuhp_setup_state_nocalls(CPUHP_RADIX_DEAD, "lib/radix:dead", NULL, radix_tree_cpu_dead); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a66d6b222ecf..fbc42e96b157 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5143,20 +5143,21 @@ static struct mem_cgroup *mem_cgroup_alloc(void) size = sizeof(struct mem_cgroup); size += nr_node_ids * sizeof(struct mem_cgroup_per_node *); - memcg = kzalloc(size, GFP_KERNEL); + memcg = kzalloc(size, GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE); if (!memcg) return ERR_PTR(error); memcg->id.id = idr_alloc(&mem_cgroup_idr, NULL, 1, MEM_CGROUP_ID_MAX, - GFP_KERNEL); + GFP_KERNEL | __GFP_GLOBAL_NONSENSITIVE); if (memcg->id.id < 0) { error = memcg->id.id; goto fail; } memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu, - GFP_KERNEL_ACCOUNT); + GFP_KERNEL_ACCOUNT | + __GFP_GLOBAL_NONSENSITIVE); if (!memcg->vmstats_percpu) goto fail; diff --git a/mm/util.c b/mm/util.c index 741ba32a43ac..0a49e15a0765 100644 --- a/mm/util.c +++ b/mm/util.c @@ -196,7 +196,8 @@ void *vmemdup_user(const void __user *src, size_t len) { void *p; - p = kvmalloc(len, GFP_USER); + /* TODO(oweisse): is this secure? */ + p = kvmalloc(len, GFP_USER | __GFP_LOCAL_NONSENSITIVE); if (!p) return ERR_PTR(-ENOMEM); diff --git a/mm/vmalloc.c b/mm/vmalloc.c index a89866a926f6..659560f286b0 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -3309,7 +3309,8 @@ EXPORT_SYMBOL(vzalloc); void *vmalloc_user(unsigned long size) { return __vmalloc_node_range(size, SHMLBA, VMALLOC_START, VMALLOC_END, - GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL, + GFP_KERNEL | __GFP_ZERO + | __GFP_LOCAL_NONSENSITIVE, PAGE_KERNEL, VM_USERMAP, NUMA_NO_NODE, __builtin_return_address(0)); } diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 909db87d7383..ce8c331386fb 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -404,7 +404,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, ? skbuff_fclone_cache : skbuff_head_cache; if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX)) - gfp_mask |= __GFP_MEMALLOC; + gfp_mask |= __GFP_MEMALLOC | __GFP_GLOBAL_NONSENSITIVE; /* Get the HEAD */ if ((flags & (SKB_ALLOC_FCLONE | SKB_ALLOC_NAPI)) == SKB_ALLOC_NAPI && diff --git a/net/core/sock.c b/net/core/sock.c index 41e91d0f7061..6f6e0bd5ebf1 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2704,7 +2704,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp) /* Avoid direct reclaim but allow kswapd to wake */ pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP | __GFP_NOWARN | - __GFP_NORETRY, + __GFP_NORETRY | __GFP_GLOBAL_NONSENSITIVE, SKB_FRAG_PAGE_ORDER); if (likely(pfrag->page)) { pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER; diff --git a/virt/kvm/coalesced_mmio.c b/virt/kvm/coalesced_mmio.c index 0be80c213f7f..5b87476566c4 100644 --- a/virt/kvm/coalesced_mmio.c +++ b/virt/kvm/coalesced_mmio.c @@ -111,7 +111,7 @@ int kvm_coalesced_mmio_init(struct kvm *kvm) { struct page *page; - page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO); + page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_LOCAL_NONSENSITIVE); if (!page) return -ENOMEM; diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c index 2ad013b8bde9..40acb841135c 100644 --- a/virt/kvm/eventfd.c +++ b/virt/kvm/eventfd.c @@ -306,7 +306,8 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args) if (!kvm_arch_irqfd_allowed(kvm, args)) return -EINVAL; - irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL_ACCOUNT); + irqfd = kzalloc(sizeof(*irqfd), + GFP_KERNEL_ACCOUNT | __GFP_GLOBAL_NONSENSITIVE); if (!irqfd) return -ENOMEM; @@ -813,7 +814,7 @@ static int kvm_assign_ioeventfd_idx(struct kvm *kvm, if (IS_ERR(eventfd)) return PTR_ERR(eventfd); - p = kzalloc(sizeof(*p), GFP_KERNEL_ACCOUNT); + p = kzalloc(sizeof(*p), GFP_KERNEL_ACCOUNT | __GFP_GLOBAL_NONSENSITIVE); if (!p) { ret = -ENOMEM; goto fail; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 8d2d76de5bd0..587a75428da8 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -370,6 +370,9 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc, gfp_t gfp_flags) { gfp_flags |= mc->gfp_zero; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + gfp_flags |= mc->gfp_asi; +#endif if (mc->kmem_cache) return kmem_cache_alloc(mc->kmem_cache, gfp_flags); @@ -863,7 +866,8 @@ static struct kvm_memslots *kvm_alloc_memslots(void) int i; struct kvm_memslots *slots; - slots = kvzalloc(sizeof(struct kvm_memslots), GFP_KERNEL_ACCOUNT); + slots = kvzalloc(sizeof(struct kvm_memslots), + GFP_KERNEL_ACCOUNT | __GFP_LOCAL_NONSENSITIVE); if (!slots) return NULL; @@ -1529,7 +1533,7 @@ static struct kvm_memslots *kvm_dup_memslots(struct kvm_memslots *old, else new_size = kvm_memslots_size(old->used_slots); - slots = kvzalloc(new_size, GFP_KERNEL_ACCOUNT); + slots = kvzalloc(new_size, GFP_KERNEL_ACCOUNT | __GFP_LOCAL_NONSENSITIVE); if (likely(slots)) kvm_copy_memslots(slots, old); @@ -3565,7 +3569,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id) } BUILD_BUG_ON(sizeof(struct kvm_run) > PAGE_SIZE); - page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO); + page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_LOCAL_NONSENSITIVE); if (!page) { r = -ENOMEM; goto vcpu_free; @@ -4959,7 +4963,7 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, return -ENOSPC; new_bus = kmalloc(struct_size(bus, range, bus->dev_count + 1), - GFP_KERNEL_ACCOUNT); + GFP_KERNEL_ACCOUNT | __GFP_LOCAL_NONSENSITIVE); if (!new_bus) return -ENOMEM; From patchwork Wed Feb 23 05:22:21 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756448 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 754E4C433EF for ; Wed, 23 Feb 2022 05:28:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238377AbiBWF2i (ORCPT ); Wed, 23 Feb 2022 00:28:38 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57948 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238265AbiBWF1U (ORCPT ); Wed, 23 Feb 2022 00:27:20 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8053E6EB21 for ; Tue, 22 Feb 2022 21:25:33 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id a19-20020a25ca13000000b0061db44646b3so26600816ybg.2 for ; Tue, 22 Feb 2022 21:25:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=Dq9qt2AUp1awc/cLmelSzNKRNDnSx+0iG/3BD+f2Pqw=; b=bPNyVJMg6CFD9HMFXRNSQizMrYjSNrjLiLhbm2g1SK1dzDwAFw9IVrAGvhhX45jt5Q MB3m8YKlJbAP43meWzITJhoGXfxoAhkBewEF4x4QZqaowTmjzbchko9N97X/yQbRhrNA ImSjY75jArlh3P4WvJxt6CSq2eFG/pUb6dFuicj5exLngzk9JG3+oOBNZTaiEkVgP1d8 VW0rHsRZ6Lp5BNVpEKjEx5KJ5bw8GUJBEAXS5TOsXJl2K77U7oBnknsAPh1rXoSbln87 yaf/9ms1aZYCgCW1qLWuFDH6hT7XNDObPyRg/4cz2jm+Z6IkcAJZ7HSdd02kUIZheQ2Z Pwpg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=Dq9qt2AUp1awc/cLmelSzNKRNDnSx+0iG/3BD+f2Pqw=; b=RH6wFHeHl+zm7Da2PAoKxfcCJgtfwaJJqLToT5WZcZuRjTGCWFPXVHF1zK0ViUba40 e0gb4QjgBopcSBhp/eNxt2Vz1NF6bRMunbPrgEIw1ghDzr//k97k/Qaxd99JcQbkMrFj OlLk6EVyKm7J0/uiRz5Vc/RAbZ7Dh+1OP8CU9uO1FIL6hOBFsiNEd+qgPQZG3M0IkTbF xcT+NrcU8cfQcBBWzCl4+7MsOqOJ+Pqb+6q6mK+VSUtdk0b6e0LGt8VWjqcWODGycQrJ dzAgB5L+J+pX0MDKoedKGLE3POPcX2WaTkzNa9edhl99FRJaYiD+m1vSfpc09tSgFNqF 0mbw== X-Gm-Message-State: AOAM5320firs+F4XSYhsKVtKnXo3VDzGK8Jo7jxvaiRPxqliDXM+k1IC WJHUbdkT/hwrlC+5O4C4GeEE501NgBp0 X-Google-Smtp-Source: ABdhPJzaqkidQawY7QCK9X5rVK71lVLf3wgt7MJ0n5efK7otIJAGSETfWRZjLt8X/bYBHlNaQvv9LeMoFDFo X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a0d:d1c5:0:b0:2ca:287c:6b81 with SMTP id t188-20020a0dd1c5000000b002ca287c6b81mr28230447ywd.38.1645593926233; Tue, 22 Feb 2022 21:25:26 -0800 (PST) Date: Tue, 22 Feb 2022 21:22:21 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-46-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 45/47] mm: asi: Mapping global nonsensitive areas in asi_global_init From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: Ofir Weisse , kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Ofir Weisse There are several areas in memory which we consider non sensitive. These areas should be mapped in every ASI domain. We map there areas in asi_global_init(). We modified some of the linking scripts to ensure these areas are starting and ending on page boundaries. The areas: - _stext --> _etext - __init_begin --> __init_end - __start_rodata --> __end_rodata - __start_once --> __end_once - __start___ex_table --> __stop___ex_table - __start_asi_nonsensitive --> __end_asi_nonsensitive - __start_asi_nonsensitive_readmostly --> __end_asi_nonsensitive_readmostly - __vvar_page --> + PAGE_SIZE - APIC_BASE --> + PAGE_SIZE - phys_base --> + PAGE_SIZE - __start___tracepoints_ptrs --> __stop___tracepoints_ptrs - __start___tracepoint_str --> __stop___tracepoint_str - __per_cpu_asi_start --> __per_cpu_asi_end (percpu) - irq_stack_backing_store --> + sizeof(irq_stack_backing_store) (percpu) The pgd's of the following addresses are cloned, modeled after KPTI: - CPU_ENTRY_AREA_BASE - ESPFIX_BASE_ADDR Signed-off-by: Ofir Weisse --- arch/x86/kernel/head_64.S | 12 +++++ arch/x86/kernel/vmlinux.lds.S | 2 +- arch/x86/mm/asi.c | 82 +++++++++++++++++++++++++++++++ include/asm-generic/vmlinux.lds.h | 13 +++-- 4 files changed, 105 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S index d8b3ebd2bb85..3d3874661895 100644 --- a/arch/x86/kernel/head_64.S +++ b/arch/x86/kernel/head_64.S @@ -574,9 +574,21 @@ SYM_DATA_LOCAL(early_gdt_descr_base, .quad INIT_PER_CPU_VAR(gdt_page)) .align 16 /* This must match the first entry in level2_kernel_pgt */ + +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION +/* TODO: Find a way to mark .section for phys_base */ +/* Ideally, we want to map phys_base in .data..asi_non_sensitive. That doesn't + * seem to work properly. For now, we just make sure phys_base is in it's own + * page. */ + .align PAGE_SIZE +#endif SYM_DATA(phys_base, .quad 0x0) EXPORT_SYMBOL(phys_base) +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + .align PAGE_SIZE +#endif + #include "../../x86/xen/xen-head.S" __PAGE_ALIGNED_BSS diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S index 3d6dc12d198f..2b3668291785 100644 --- a/arch/x86/kernel/vmlinux.lds.S +++ b/arch/x86/kernel/vmlinux.lds.S @@ -148,8 +148,8 @@ SECTIONS } :text =0xcccc /* End of text section, which should occupy whole number of pages */ - _etext = .; . = ALIGN(PAGE_SIZE); + _etext = .; X86_ALIGN_RODATA_BEGIN RO_DATA(PAGE_SIZE) diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c index 04628949e89d..7f2aa1823736 100644 --- a/arch/x86/mm/asi.c +++ b/arch/x86/mm/asi.c @@ -9,6 +9,7 @@ #include #include +#include /* struct irq_stack */ #include #include "mm_internal.h" @@ -17,6 +18,24 @@ #undef pr_fmt #define pr_fmt(fmt) "ASI: " fmt +#include +#include + +extern struct exception_table_entry __start___ex_table[]; +extern struct exception_table_entry __stop___ex_table[]; + +extern const char __start_asi_nonsensitive[], __end_asi_nonsensitive[]; +extern const char __start_asi_nonsensitive_readmostly[], + __end_asi_nonsensitive_readmostly[]; +extern const char __per_cpu_asi_start[], __per_cpu_asi_end[]; +extern const char *__start___tracepoint_str[]; +extern const char *__stop___tracepoint_str[]; +extern const char *__start___tracepoints_ptrs[]; +extern const char *__stop___tracepoints_ptrs[]; +extern const char __vvar_page[]; + +DECLARE_PER_CPU_PAGE_ALIGNED(struct irq_stack, irq_stack_backing_store); + static struct asi_class asi_class[ASI_MAX_NUM] __asi_not_sensitive; static DEFINE_SPINLOCK(asi_class_lock __asi_not_sensitive); @@ -412,6 +431,7 @@ void asi_unload_module(struct module* module) static int __init asi_global_init(void) { uint i, n; + int err = 0; if (!boot_cpu_has(X86_FEATURE_ASI)) return 0; @@ -436,6 +456,68 @@ static int __init asi_global_init(void) pcpu_map_asi_reserved_chunk(); + + /* + * TODO: We need to ensure that all the sections mapped below are + * actually page-aligned by the linker. For now, we temporarily just + * align the start/end addresses here, but that is incorrect as the + * rest of the page could potentially contain sensitive data. + */ +#define MAP_SECTION(start, end) \ + pr_err("%s:%d mapping 0x%lx --> 0x%lx", \ + __FUNCTION__, __LINE__, start, end); \ + err = asi_map(ASI_GLOBAL_NONSENSITIVE, \ + (void*)((unsigned long)(start) & PAGE_MASK),\ + PAGE_ALIGN((unsigned long)(end)) - \ + ((unsigned long)(start) & PAGE_MASK)); \ + BUG_ON(err); + +#define MAP_SECTION_PERCPU(start, size) \ + pr_err("%s:%d mapping PERCPU 0x%lx --> 0x%lx", \ + __FUNCTION__, __LINE__, start, (unsigned long)start+size); \ + err = asi_map_percpu(ASI_GLOBAL_NONSENSITIVE, \ + (void*)((unsigned long)(start) & PAGE_MASK), \ + PAGE_ALIGN((unsigned long)(size))); \ + BUG_ON(err); + + MAP_SECTION(_stext, _etext); + MAP_SECTION(__init_begin, __init_end); + MAP_SECTION(__start_rodata, __end_rodata); + MAP_SECTION(__start_once, __end_once); + MAP_SECTION(__start___ex_table, __stop___ex_table); + MAP_SECTION(__start_asi_nonsensitive, __end_asi_nonsensitive); + MAP_SECTION(__start_asi_nonsensitive_readmostly, + __end_asi_nonsensitive_readmostly); + MAP_SECTION(__vvar_page, __vvar_page + PAGE_SIZE); + MAP_SECTION(APIC_BASE, APIC_BASE + PAGE_SIZE); + MAP_SECTION(&phys_base, &phys_base + PAGE_SIZE); + + /* TODO: add a build flag to enable disable mapping only when + * instrumentation is used */ + MAP_SECTION(__start___tracepoints_ptrs, __stop___tracepoints_ptrs); + MAP_SECTION(__start___tracepoint_str, __stop___tracepoint_str); + + MAP_SECTION_PERCPU((void*)__per_cpu_asi_start, + __per_cpu_asi_end - __per_cpu_asi_start); + + MAP_SECTION_PERCPU(&irq_stack_backing_store, + sizeof(irq_stack_backing_store)); + + /* We have to map the stack canary into ASI. This is far from ideal, as + * attackers can use L1TF to steal the canary value, and then perhaps + * mount some other attack including a buffer overflow. This is a price + * we must pay to use ASI. + */ + MAP_SECTION_PERCPU(&fixed_percpu_data, PAGE_SIZE); + +#define CLONE_INIT_PGD(addr) \ + asi_clone_pgd(asi_global_nonsensitive_pgd, init_mm.pgd, addr); + + CLONE_INIT_PGD(CPU_ENTRY_AREA_BASE); +#ifdef CONFIG_X86_ESPFIX64 + CLONE_INIT_PGD(ESPFIX_BASE_ADDR); +#endif + return 0; } subsys_initcall(asi_global_init) diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index 0a931aedc285..7152ce3613f5 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -235,8 +235,10 @@ #define TRACE_PRINTKS() __start___trace_bprintk_fmt = .; \ KEEP(*(__trace_printk_fmt)) /* Trace_printk fmt' pointer */ \ __stop___trace_bprintk_fmt = .; -#define TRACEPOINT_STR() __start___tracepoint_str = .; \ +#define TRACEPOINT_STR() . = ALIGN(PAGE_SIZE); \ + __start___tracepoint_str = .; \ KEEP(*(__tracepoint_str)) /* Trace_printk fmt' pointer */ \ + . = ALIGN(PAGE_SIZE); \ __stop___tracepoint_str = .; #else #define TRACE_PRINTKS() @@ -348,8 +350,10 @@ MEM_KEEP(init.data*) \ MEM_KEEP(exit.data*) \ *(.data.unlikely) \ + . = ALIGN(PAGE_SIZE); \ __start_once = .; \ *(.data.once) \ + . = ALIGN(PAGE_SIZE); \ __end_once = .; \ STRUCT_ALIGN(); \ *(__tracepoints) \ @@ -453,9 +457,10 @@ *(.rodata) *(.rodata.*) \ SCHED_DATA \ RO_AFTER_INIT_DATA /* Read only after init */ \ - . = ALIGN(8); \ + . = ALIGN(PAGE_SIZE); \ __start___tracepoints_ptrs = .; \ KEEP(*(__tracepoints_ptrs)) /* Tracepoints: pointer array */ \ + . = ALIGN(PAGE_SIZE); \ __stop___tracepoints_ptrs = .; \ *(__tracepoints_strings)/* Tracepoints: strings */ \ } \ @@ -671,11 +676,13 @@ */ #define EXCEPTION_TABLE(align) \ . = ALIGN(align); \ + . = ALIGN(PAGE_SIZE); \ __ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) { \ __start___ex_table = .; \ KEEP(*(__ex_table)) \ + . = ALIGN(PAGE_SIZE); \ __stop___ex_table = .; \ - } + } \ /* * .BTF From patchwork Wed Feb 23 05:22:22 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756451 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5720CC433EF for ; Wed, 23 Feb 2022 05:29:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237231AbiBWF3y (ORCPT ); Wed, 23 Feb 2022 00:29:54 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57958 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238471AbiBWF1U (ORCPT ); Wed, 23 Feb 2022 00:27:20 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 439246F484 for ; Tue, 22 Feb 2022 21:25:34 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-2d07ae11464so162269707b3.14 for ; Tue, 22 Feb 2022 21:25:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=cWpgk2PD8ywov80MgvYy738g6oDVkPJMa4mg6Kd57WI=; b=EONmeIZhbgQ2Ix8S9q3J8G/HTiRJQe/uAmut/mIfTv5CIy/cqk4x/lLfaofIL6ZsPO 2DK7UMkGosmFYOPK8XbKYJ2B7I4QKSD7X8xoOad3cTZw5YniQxiy63dmB34fBM3ScXK9 b4M3PhJD3t0ZCy25IHGFNrU1Wai96D7oCn70Yeg5Az606EQRIrVrKV03yc49SjAoBqEY D/bmDYBLa30k0Q0T1/0E+zVvHZ7GWXBtnOemyeM6dCZ+R2nj996pZd5wNPpUZxVxnqxJ WaqOqmkCjUVkdvW9sUwO7Bbj6HXZvuS7EIm1omaeZny3PaSOUoJHhQmMblrL8LNsc6wd q1rg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=cWpgk2PD8ywov80MgvYy738g6oDVkPJMa4mg6Kd57WI=; b=ZPr686fB9C16mvBE2Lih674/lw2/G0f3XAI4AE/fqpJUui+0ZEfnByn7KmSa+YaeOj md6rGSPQMenNAi2m8W4H5fUaH47yUWUJv3bw08/sWapbxJ7ejLGhhHQ0WBl4PGfVaMlA bt+iRV83eDS0oL8X7mV4l5iDOQH9V8r7uGAn4N7COLys7QEBLD3Hjd/JJ0PTNUuVgwh8 DXG4CxWuA3ZKNaU3Y/42xgbYW7mPHQUELNZIy9XMeMKASbB247Nnoyt3xMKtZkyJUPpT zvGGk6RuM0eLCJ8Ho/Q0hUxEMtP2U/XV9ZB+GkyyohAVzYuwX3lqnJ+F0Kc7fGzpcAgS Ufzg== X-Gm-Message-State: AOAM533pkdzJrkICaCqeDRMnfbmorLIukyMNmUu3dNEM9n0VUdH05WyN qxDIf+1z0Uh0Oy/9jGs0tbGLWMTtKHm5 X-Google-Smtp-Source: ABdhPJxkXSiWBnVLfkr470oqcQZfrXanhJKcgx/q87EPE8RXSdLDTEk6wPn7sTX4KYfxfNXBGCkwiM0xCmzr X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a25:bf87:0:b0:622:1e66:e7fd with SMTP id l7-20020a25bf87000000b006221e66e7fdmr25540509ybk.341.1645593928498; Tue, 22 Feb 2022 21:25:28 -0800 (PST) Date: Tue, 22 Feb 2022 21:22:22 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-47-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 46/47] kvm: asi: Do asi_exit() in vcpu_run loop before returning to userspace From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: Ofir Weisse , kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Ofir Weisse For the time being, we switch to the full kernel address space before returning back to userspace. Once KPTI is also implemented using ASI, we could potentially also switch to the KPTI address space directly. Signed-off-by: Ofir Weisse --- arch/x86/kvm/x86.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 680725089a18..294f73e9e71e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10148,13 +10148,17 @@ static int vcpu_run(struct kvm_vcpu *vcpu) srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx); r = xfer_to_guest_mode_handle_work(vcpu); if (r) - return r; + goto exit; vcpu->srcu_idx = srcu_read_lock(&kvm->srcu); } } srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx); +exit: + /* TODO(oweisse): trace this exit if we're still within an ASI. */ + asi_exit(); + return r; } From patchwork Wed Feb 23 05:22:23 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junaid Shahid X-Patchwork-Id: 12756452 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 03D75C433EF for ; Wed, 23 Feb 2022 05:29:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238188AbiBWF35 (ORCPT ); Wed, 23 Feb 2022 00:29:57 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58080 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237608AbiBWF1h (ORCPT ); Wed, 23 Feb 2022 00:27:37 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0F3286FA32 for ; Tue, 22 Feb 2022 21:25:37 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-2d6baed6aafso144154937b3.3 for ; Tue, 22 Feb 2022 21:25:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=TaqdfVv20HSlc0A1c0PRFCOb1G6wQiL20chjexIyKqk=; b=rGs+4FxaDjpCpPLEg7H9QRppi6us6PQKtKoDaRWUcIklSnWA7hAYHi0ajBlLinkdGD RR60+N9t//ZgngiTNXXIeQGAWtu9hbb85MnwRsz6HZkNODh5q+Hvo26B+KyLDLD0wO72 Km+o4S/5IA7Nm5fHy35QBZ6gA5Jj2sdkDbtwen0aMy6pdcJNCDB87jD1BH+xf2J+Qqo1 uywNmgZooXGbPVFc7sIEcTr3h+ZPEiF9YjXU/CTb1IwHcqN0Ljdos6xBZ49r2iJJ80jO IU+HktHwWbxx7NyomQ4O+OP+COIay05VRI+prWu+WSkdgnQPlZ0X+27JswkXnsvP2ZOJ /p0g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=TaqdfVv20HSlc0A1c0PRFCOb1G6wQiL20chjexIyKqk=; b=bTv0142fVS7mDRXD0lPzTPJCBQkNSXaRIIJMeUW+3MOAszArhgruInqt2Ri76CrPPI 5O9wMYMX5Yu5Zuy5g35n5hOQt48fOwGg5VJpujOAUXG1wGY7VEMPXdkU0q2ZmvhGMFaK rsHOnGLDy9yB1U/G5GKw5qyItrYm6ncyaux8jIED6DyqA//4kLuw9q6Lol+iYhd0zSgA 14ViBDHIx/wZMyBk5tvIqB8S5UtY+hWen2mYhuSCexBho6nn+suqFbGh+OJLrOJain38 iuypmQK0Sbqwqoghxha7J0BT/J9bfmGChenXvWBuhKDZziEgyY5MhiPiaPwjBQbhIs0J irNg== X-Gm-Message-State: AOAM532GVFCERK023aHoYlmiGpfbFK5vz5Z62y+fedxF8TJ07iOaAyBz nAHtVlkhxb6zi68WivCtsT1hKLgU3Ybn X-Google-Smtp-Source: ABdhPJwmqi34TRvmt135zAgWCIHRaV4KjaCBBqxLVhnQGHlMtw+LypDzffAe2O8Zb/1XiXdFw/2ftIFsbgRk X-Received: from js-desktop.svl.corp.google.com ([2620:15c:2cd:202:ccbe:5d15:e2e6:322]) (user=junaids job=sendgmr) by 2002:a25:6fc1:0:b0:624:43a0:c16c with SMTP id k184-20020a256fc1000000b0062443a0c16cmr21683087ybc.222.1645593930719; Tue, 22 Feb 2022 21:25:30 -0800 (PST) Date: Tue, 22 Feb 2022 21:22:23 -0800 In-Reply-To: <20220223052223.1202152-1-junaids@google.com> Message-Id: <20220223052223.1202152-48-junaids@google.com> Mime-Version: 1.0 References: <20220223052223.1202152-1-junaids@google.com> X-Mailer: git-send-email 2.35.1.473.g83b2b277ed-goog Subject: [RFC PATCH 47/47] mm: asi: Properly un/mapping task stack from ASI + tlb flush From: Junaid Shahid To: linux-kernel@vger.kernel.org Cc: Ofir Weisse , kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Ofir Weisse There are several locations where this is important. Especially since a task_struct might be reused, potentially with a different mm. 1. Map in vcpu_run() @ arch/x86/kvm/x86.c 1. Unmap in release_task_stack() @ kernel/fork.c 2. Unmap in do_exit() @ kernel/exit.c 3. Unmap in begin_new_exec() @ fs/exec.c Signed-off-by: Ofir Weisse --- arch/x86/include/asm/asi.h | 6 ++++ arch/x86/kvm/x86.c | 6 ++++ arch/x86/mm/asi.c | 59 ++++++++++++++++++++++++++++++++++++++ fs/exec.c | 7 ++++- include/asm-generic/asi.h | 16 +++++++++-- include/linux/sched.h | 5 ++++ kernel/exit.c | 2 +- kernel/fork.c | 22 +++++++++++++- 8 files changed, 118 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h index 6148e65fb0c2..9d8f43981678 100644 --- a/arch/x86/include/asm/asi.h +++ b/arch/x86/include/asm/asi.h @@ -87,6 +87,12 @@ void asi_unmap_user(struct asi *asi, void *va, size_t len); int asi_fill_pgtbl_pool(struct asi_pgtbl_pool *pool, uint count, gfp_t flags); void asi_clear_pgtbl_pool(struct asi_pgtbl_pool *pool); +int asi_map_task_stack(struct task_struct *tsk, struct asi *asi); +void asi_unmap_task_stack(struct task_struct *tsk); +void asi_mark_pages_local_nonsensitive(struct page *pages, uint order, + struct mm_struct *mm); +void asi_clear_pages_local_nonsensitive(struct page *pages, uint order); + static inline void asi_init_pgtbl_pool(struct asi_pgtbl_pool *pool) { pool->pgtbl_list = NULL; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 294f73e9e71e..718104eefaed 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10122,6 +10122,12 @@ static int vcpu_run(struct kvm_vcpu *vcpu) vcpu->srcu_idx = srcu_read_lock(&kvm->srcu); vcpu->arch.l1tf_flush_l1d = true; + /* We must have current->stack mapped into asi. This function can be + * safely called many times, as it will only do the actual mapping once. */ + r = asi_map_task_stack(current, vcpu->kvm->asi); + if (r != 0) + return r; + for (;;) { if (kvm_vcpu_running(vcpu)) { r = vcpu_enter_guest(vcpu); diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c index 7f2aa1823736..a86ac6644a57 100644 --- a/arch/x86/mm/asi.c +++ b/arch/x86/mm/asi.c @@ -1029,6 +1029,45 @@ void asi_unmap(struct asi *asi, void *addr, size_t len, bool flush_tlb) asi_flush_tlb_range(asi, addr, len); } +int asi_map_task_stack(struct task_struct *tsk, struct asi *asi) +{ + int ret; + + /* If the stack is already mapped to asi - don't need to map it again. */ + if (tsk->asi_stack_mapped) + return 0; + + if (!tsk->mm) + return -EINVAL; + + /* If the stack was allocated via the page allocator, we assume the + * stack pages were marked with PageNonSensitive, therefore tsk->stack + * address is properly aliased. */ + ret = asi_map(ASI_LOCAL_NONSENSITIVE, tsk->stack, THREAD_SIZE); + if (!ret) { + tsk->asi_stack_mapped = asi; + asi_sync_mapping(asi, tsk->stack, THREAD_SIZE); + } + + return ret; +} + +void asi_unmap_task_stack(struct task_struct *tsk) +{ + /* No need to unmap if the stack was not mapped to begin with. */ + if (!tsk->asi_stack_mapped) + return; + + if (!tsk->mm) + return; + + asi_unmap(ASI_LOCAL_NONSENSITIVE, tsk->stack, THREAD_SIZE, + /* flush_tlb = */ true); + + tsk->asi_stack_mapped = NULL; +} + + void *asi_va(unsigned long pa) { struct page *page = pfn_to_page(PHYS_PFN(pa)); @@ -1336,3 +1375,23 @@ void asi_unmap_user(struct asi *asi, void *addr, size_t len) } } EXPORT_SYMBOL_GPL(asi_unmap_user); + +void asi_mark_pages_local_nonsensitive(struct page *pages, uint order, + struct mm_struct *mm) +{ + uint i; + for (i = 0; i < (1 << order); i++) { + __SetPageLocalNonSensitive(pages + i); + pages[i].asi_mm = mm; + } +} + +void asi_clear_pages_local_nonsensitive(struct page *pages, uint order) +{ + uint i; + for (i = 0; i < (1 << order); i++) { + __ClearPageLocalNonSensitive(pages + i); + pages[i].asi_mm = NULL; + } +} + diff --git a/fs/exec.c b/fs/exec.c index 76f3b433e80d..fb9182cf3f33 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -69,6 +69,7 @@ #include #include #include +#include #include #include "internal.h" @@ -1238,7 +1239,11 @@ int begin_new_exec(struct linux_binprm * bprm) struct task_struct *me = current; int retval; - /* TODO: (oweisse) unmap the stack from ASI */ + /* The old mm is about to be released later on in exec_mmap. We are + * reusing the task, including its stack which was mapped to + * mm->asi_pgd[0]. We need to asi_unmap the stack, so the destructor of + * the mm won't complain on "lingering" asi mappings. */ + asi_unmap_task_stack(current); /* Once we are committed compute the creds */ retval = bprm_creds_from_file(bprm); diff --git a/include/asm-generic/asi.h b/include/asm-generic/asi.h index 2763cb1a974c..6e9a261a2b9d 100644 --- a/include/asm-generic/asi.h +++ b/include/asm-generic/asi.h @@ -66,8 +66,13 @@ static inline struct asi *asi_get_target(void) { return NULL; } static inline struct asi *asi_get_current(void) { return NULL; } -static inline -int asi_map_gfp(struct asi *asi, void *addr, size_t len, gfp_t gfp_flags) +static inline int asi_map_task_stack(struct task_struct *tsk, struct asi *asi) +{ return 0; } + +static inline void asi_unmap_task_stack(struct task_struct *tsk) { } + +static inline int asi_map_gfp(struct asi *asi, void *addr, size_t len, + gfp_t gfp_flags) { return 0; } @@ -130,6 +135,13 @@ static inline int asi_load_module(struct module* module) {return 0;} static inline void asi_unload_module(struct module* module) { } +static inline +void asi_mark_pages_local_nonsensitive(struct page *pages, uint order, + struct mm_struct *mm) { } + +static inline +void asi_clear_pages_local_nonsensitive(struct page *pages, uint order) { } + #endif /* !_ASSEMBLY_ */ #endif /* !CONFIG_ADDRESS_SPACE_ISOLATION */ diff --git a/include/linux/sched.h b/include/linux/sched.h index 78c351e35fec..87ad45e52b19 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -67,6 +67,7 @@ struct sighand_struct; struct signal_struct; struct task_delay_info; struct task_group; +struct asi; /* * Task state bitmask. NOTE! These bits are also @@ -1470,6 +1471,10 @@ struct task_struct { int mce_count; #endif +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + struct asi *asi_stack_mapped; +#endif + #ifdef CONFIG_KRETPROBES struct llist_head kretprobe_instances; #endif diff --git a/kernel/exit.c b/kernel/exit.c index ab2749cf6887..f21cc21814d1 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -768,7 +768,7 @@ void __noreturn do_exit(long code) profile_task_exit(tsk); kcov_task_exit(tsk); - /* TODO: (oweisse) unmap the stack from ASI */ + asi_unmap_task_stack(tsk); coredump_task_exit(tsk); ptrace_event(PTRACE_EVENT_EXIT, code); diff --git a/kernel/fork.c b/kernel/fork.c index cb147a72372d..876fefc477cb 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -216,7 +216,6 @@ static int free_vm_stack_cache(unsigned int cpu) static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node) { - /* TODO: (oweisse) Add annotation to map the stack into ASI */ #ifdef CONFIG_VMAP_STACK void *stack; int i; @@ -269,7 +268,16 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node) struct page *page = alloc_pages_node(node, THREADINFO_GFP, THREAD_SIZE_ORDER); + /* When marking pages as PageLocalNonSesitive we set the page->mm to be + * NULL. We must make sure the flag is cleared from the stack pages + * before free_pages is called. Otherwise, page->mm will be accessed + * which will reuslt in NULL reference. page_address() below will yield + * an aliased address after ASI_LOCAL_MAP, thanks to + * PageLocalNonSesitive flag. */ if (likely(page)) { + asi_mark_pages_local_nonsensitive(page, + THREAD_SIZE_ORDER, + NULL); tsk->stack = kasan_reset_tag(page_address(page)); return tsk->stack; } @@ -301,6 +309,14 @@ static inline void free_thread_stack(struct task_struct *tsk) } #endif + /* We must clear the PageNonSensitive flag before calling free_pages(). + * Otherwise page->mm (which is NULL) will be accessed, in order to + * unmap the pages from ASI. Specifically for the stack, we assume the + * pages were already unmapped from ASI before we got here, via + * asi_unmap_task_stack(). */ + asi_clear_pages_local_nonsensitive(virt_to_page(tsk->stack), + THREAD_SIZE_ORDER); + __free_pages(virt_to_page(tsk->stack), THREAD_SIZE_ORDER); } # else @@ -436,6 +452,7 @@ static void release_task_stack(struct task_struct *tsk) if (WARN_ON(READ_ONCE(tsk->__state) != TASK_DEAD)) return; /* Better to leak the stack than to free prematurely */ + asi_unmap_task_stack(tsk); account_kernel_stack(tsk, -1); free_thread_stack(tsk); tsk->stack = NULL; @@ -916,6 +933,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) * functions again. */ tsk->stack = stack; +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION + tsk->asi_stack_mapped = NULL; +#endif #ifdef CONFIG_VMAP_STACK tsk->stack_vm_area = stack_vm_area; #endif