From patchwork Thu Jan 20 15:55:18 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Zijlstra X-Patchwork-Id: 12718821 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A5D28C4332F for ; Thu, 20 Jan 2022 16:09:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 068196B0096; Thu, 20 Jan 2022 11:09:09 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id F32FB6B009C; Thu, 20 Jan 2022 11:09:08 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DD4526B00A0; Thu, 20 Jan 2022 11:09:08 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0112.hostedemail.com [216.40.44.112]) by kanga.kvack.org (Postfix) with ESMTP id CC8EA6B0096 for ; Thu, 20 Jan 2022 11:09:08 -0500 (EST) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 8DF7A181CA330 for ; Thu, 20 Jan 2022 16:09:08 +0000 (UTC) X-FDA: 79051149576.19.4B64F66 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf25.hostedemail.com (Postfix) with ESMTP id 88132A0045 for ; Thu, 20 Jan 2022 16:09:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=P7lCKswW2dVYa2ibcKWxMvFcXeaAwTyHA88lMKA84Q0=; b=v7q1B2c2V4kGj/vOH2Qdqb5AKJ 76RBDZVgd7MjqInz8C+SyL+O21TLFjpBnB9bDU/iIQKKOTGFfhHqTFJQeo46nd2QuEF5OuO1/5Q5P IGmCd8OOQR8jn8YWkYEHjoXkIRAz84a4V+M3tRHG8NcV0UYLjLBXBK5Pi5eozySrqtnGkVsHvGYlh ziR8vby2lcAhDf3TwpHJdzZYUaiSL6rXPgAHcBf9/eGxduU9+ombCT4go8+vnMznRoTnlU5IuWXjU qFuD1HIRpWpXIj8zQxq2sy+iM4gdh9HCQxCeLbZ5kJc34VhIVjrACnOK+XK3wWkXotFLdOqNnbMsD k4D9FEiA==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1nAZza-00EQ4H-3u; Thu, 20 Jan 2022 16:08:54 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id BB9AC3004AD; Thu, 20 Jan 2022 17:08:51 +0100 (CET) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 0) id 8F83120291FE3; Thu, 20 Jan 2022 17:08:51 +0100 (CET) Message-ID: <20220120160822.666778608@infradead.org> User-Agent: quilt/0.66 Date: Thu, 20 Jan 2022 16:55:18 +0100 From: Peter Zijlstra To: mingo@redhat.com, tglx@linutronix.de, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-api@vger.kernel.org, x86@kernel.org, peterz@infradead.org, pjt@google.com, posk@google.com, avagin@google.com, jannh@google.com, tdelisle@uwaterloo.ca, mark.rutland@arm.com, posk@posk.io Subject: [RFC][PATCH v2 1/5] mm: Avoid unmapping pinned pages References: <20220120155517.066795336@infradead.org> MIME-Version: 1.0 X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 88132A0045 X-Stat-Signature: jeb7aka78dht7hk4yjukfx3sf4dm3ckw Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=v7q1B2c2; spf=none (imf25.hostedemail.com: domain of peterz@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=peterz@infradead.org; dmarc=none X-HE-Tag: 1642694946-273758 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Add a guarantee for Anon pages that pin_user_page*() ensures the user-mapping of these pages stay preserved. In order to ensure this all rmap users have been audited: vmscan: already fails eviction due to page_maybe_dma_pinned() migrate: migration will fail on pinned pages due to expected_page_refs() not matching, however that is *after* try_to_migrate() has already destroyed the user mapping of these pages. Add an early exit for this case. numa-balance: as per the above, pinned pages cannot be migrated, however numa balancing scanning will happily PROT_NONE them to get usage information on these pages. Avoid this for pinned pages. None of the other rmap users (damon,page-idle,mlock,..) unmap the page, they mostly just muck about with reference,dirty flags etc. This same guarantee cannot be provided for Shared (file) pages due to dirty page tracking. Signed-off-by: Peter Zijlstra (Intel) --- mm/migrate.c | 10 +++++++++- mm/mprotect.c | 6 ++++++ 2 files changed, 15 insertions(+), 1 deletion(-) --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1472,7 +1472,15 @@ int migrate_pages(struct list_head *from nr_subpages = thp_nr_pages(page); cond_resched(); - if (PageHuge(page)) + /* + * If the page has a pin then expected_page_refs() will + * not match and the whole migration will fail later + * anyway, fail early and preserve the mappings. + */ + if (page_maybe_dma_pinned(page)) + rc = -EAGAIN; + + else if (PageHuge(page)) rc = unmap_and_move_huge_page(get_new_page, put_new_page, private, page, pass > 2, mode, reason, --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -106,6 +106,12 @@ static unsigned long change_pte_range(st continue; /* + * Can't migrate pinned pages, avoid touching them. + */ + if (page_maybe_dma_pinned(page)) + continue; + + /* * Don't mess with PTEs if page is already on the node * a single-threaded process is running on. */ From patchwork Thu Jan 20 15:55:19 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Zijlstra X-Patchwork-Id: 12718823 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9DCE8C433EF for ; Thu, 20 Jan 2022 16:09:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8E08C6B009C; Thu, 20 Jan 2022 11:09:09 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8329D6B00A2; Thu, 20 Jan 2022 11:09:09 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6AB986B00A5; Thu, 20 Jan 2022 11:09:09 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0193.hostedemail.com [216.40.44.193]) by kanga.kvack.org (Postfix) with ESMTP id 517446B009C for ; Thu, 20 Jan 2022 11:09:09 -0500 (EST) Received: from smtpin05.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 19C67181CA330 for ; Thu, 20 Jan 2022 16:09:09 +0000 (UTC) X-FDA: 79051149618.05.DEA536A Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf30.hostedemail.com (Postfix) with ESMTP id AD07380035 for ; Thu, 20 Jan 2022 16:09:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=mtyDfDtmj2E1gCt4wdSrRmdg+40wSt1K3yZYfH1SU2k=; b=Pj+P38GZDROrZRh4B0ckXlZKk8 VASbGsesh6YXU33HFWMsT5A40RVp6biot2Bdyx5TifYrBK3WCh2iQt9YTAzLln3ftFOYBVr+idBGa Cr8wLjuPfV3hdeKIRT0mAnk4u6XCClqIgYOjtCGzwVq4LpPZuIevEqBupgWfb2cpOendX35e8/FCT l4IquWwOmwO5ASWNC/YBoZPn9WaZtAC57Be5E8rzFzsaiNl+H+Qy+KrP6+u0lLvwwHO6BRXKoenhI 4ClmQgmWCEXtm71S3yAMZ6JtrX1IZC5XQG+Zl7w0t5um9wZoI1YL7KitwSetgBIB9eLoZEZCQ0q20 1dYGNPDQ==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1nAZza-00EQ4I-3h; Thu, 20 Jan 2022 16:08:54 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id BDC913005CA; Thu, 20 Jan 2022 17:08:51 +0100 (CET) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 0) id 9453C2028047C; Thu, 20 Jan 2022 17:08:51 +0100 (CET) Message-ID: <20220120160822.728589540@infradead.org> User-Agent: quilt/0.66 Date: Thu, 20 Jan 2022 16:55:19 +0100 From: Peter Zijlstra To: mingo@redhat.com, tglx@linutronix.de, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-api@vger.kernel.org, x86@kernel.org, peterz@infradead.org, pjt@google.com, posk@google.com, avagin@google.com, jannh@google.com, tdelisle@uwaterloo.ca, mark.rutland@arm.com, posk@posk.io Subject: [RFC][PATCH v2 2/5] entry,x86: Create common IRQ operations for exceptions References: <20220120155517.066795336@infradead.org> MIME-Version: 1.0 X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: AD07380035 X-Stat-Signature: be79mxdk41tcgmukfe4goxqqj94ywcgy Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=Pj+P38GZ; spf=none (imf30.hostedemail.com: domain of peterz@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=peterz@infradead.org; dmarc=none X-HE-Tag: 1642694947-182804 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: A number of exceptions can re-enable IRQs and schedule (provided the context taking the exception has IRQs enabled etc.). Currently this isn't standardized and each architecture does it slightly different. (Notably, ARM64 restores/inherits more than just the IRQ state, suggesting at least some form of arch_ hook might be appropriate on top of this). Create a single set of functions for this and convert x86 to use these. The purpose is to have a single (common) place to hook in order to cover all the exceptions that can schedule(). Signed-off-by: Peter Zijlstra (Intel) --- arch/x86/kernel/traps.c | 48 ++++++++++++++++--------------------------- arch/x86/mm/fault.c | 28 ++++++++++++------------- include/linux/entry-common.h | 24 +++++++++++++++++++++ 3 files changed, 56 insertions(+), 44 deletions(-) --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -73,18 +73,6 @@ DECLARE_BITMAP(system_vectors, NR_VECTORS); -static inline void cond_local_irq_enable(struct pt_regs *regs) -{ - if (regs->flags & X86_EFLAGS_IF) - local_irq_enable(); -} - -static inline void cond_local_irq_disable(struct pt_regs *regs) -{ - if (regs->flags & X86_EFLAGS_IF) - local_irq_disable(); -} - __always_inline int is_valid_bugaddr(unsigned long addr) { if (addr < TASK_SIZE_MAX) @@ -177,9 +165,9 @@ static void do_error_trap(struct pt_regs if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, signr) != NOTIFY_STOP) { - cond_local_irq_enable(regs); + irqentry_irq_enable(regs); do_trap(trapnr, signr, str, regs, error_code, sicode, addr); - cond_local_irq_disable(regs); + irqentry_irq_disable(regs); } } @@ -300,7 +288,7 @@ DEFINE_IDTENTRY_ERRORCODE(exc_alignment_ if (!user_mode(regs)) die("Split lock detected\n", regs, error_code); - local_irq_enable(); + irqentry_irq_enable(regs); if (handle_user_split_lock(regs, error_code)) goto out; @@ -309,7 +297,7 @@ DEFINE_IDTENTRY_ERRORCODE(exc_alignment_ error_code, BUS_ADRALN, NULL); out: - local_irq_disable(); + irqentry_irq_disable(regs); } #ifdef CONFIG_VMAP_STACK @@ -473,14 +461,14 @@ DEFINE_IDTENTRY(exc_bounds) if (notify_die(DIE_TRAP, "bounds", regs, 0, X86_TRAP_BR, SIGSEGV) == NOTIFY_STOP) return; - cond_local_irq_enable(regs); + irqentry_irq_enable(regs); if (!user_mode(regs)) die("bounds", regs, 0); do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, 0, 0, NULL); - cond_local_irq_disable(regs); + irqentry_irq_disable(regs); } enum kernel_gp_hint { @@ -567,7 +555,7 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_pr unsigned long gp_addr; int ret; - cond_local_irq_enable(regs); + irqentry_irq_enable(regs); if (static_cpu_has(X86_FEATURE_UMIP)) { if (user_mode(regs) && fixup_umip_exception(regs)) @@ -638,7 +626,7 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_pr die_addr(desc, regs, error_code, gp_addr); exit: - cond_local_irq_disable(regs); + irqentry_irq_disable(regs); } static bool do_int3(struct pt_regs *regs) @@ -665,9 +653,9 @@ static void do_int3_user(struct pt_regs if (do_int3(regs)) return; - cond_local_irq_enable(regs); + irqentry_irq_enable(regs); do_trap(X86_TRAP_BP, SIGTRAP, "int3", regs, 0, 0, NULL); - cond_local_irq_disable(regs); + irqentry_irq_disable(regs); } DEFINE_IDTENTRY_RAW(exc_int3) @@ -1003,7 +991,7 @@ static __always_inline void exc_debug_us goto out; /* It's safe to allow irq's after DR6 has been saved */ - local_irq_enable(); + irqentry_irq_enable(regs); if (v8086_mode(regs)) { handle_vm86_trap((struct kernel_vm86_regs *)regs, 0, X86_TRAP_DB); @@ -1020,7 +1008,7 @@ static __always_inline void exc_debug_us send_sigtrap(regs, 0, get_si_code(dr6)); out_irq: - local_irq_disable(); + irqentry_irq_disable(regs); out: instrumentation_end(); irqentry_exit_to_user_mode(regs); @@ -1064,7 +1052,7 @@ static void math_error(struct pt_regs *r char *str = (trapnr == X86_TRAP_MF) ? "fpu exception" : "simd exception"; - cond_local_irq_enable(regs); + irqentry_irq_enable(regs); if (!user_mode(regs)) { if (fixup_exception(regs, trapnr, 0, 0)) @@ -1099,7 +1087,7 @@ static void math_error(struct pt_regs *r force_sig_fault(SIGFPE, si_code, (void __user *)uprobe_get_trap_addr(regs)); exit: - cond_local_irq_disable(regs); + irqentry_irq_disable(regs); } DEFINE_IDTENTRY(exc_coprocessor_error) @@ -1160,7 +1148,7 @@ static bool handle_xfd_event(struct pt_r if (WARN_ON(!user_mode(regs))) return false; - local_irq_enable(); + irqentry_irq_enable(regs); err = xfd_enable_feature(xfd_err); @@ -1173,7 +1161,7 @@ static bool handle_xfd_event(struct pt_r break; } - local_irq_disable(); + irqentry_irq_disable(regs); return true; } @@ -1188,12 +1176,12 @@ DEFINE_IDTENTRY(exc_device_not_available if (!boot_cpu_has(X86_FEATURE_FPU) && (cr0 & X86_CR0_EM)) { struct math_emu_info info = { }; - cond_local_irq_enable(regs); + irqentry_irq_enable(regs); info.regs = regs; math_emulate(&info); - cond_local_irq_disable(regs); + irqentry_irq_disable(regs); return; } #endif --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1209,6 +1209,12 @@ do_kern_addr_fault(struct pt_regs *regs, NOKPROBE_SYMBOL(do_kern_addr_fault); /* + * EFLAGS[3] is unused and ABI defined to be 0, use it to store IRQ state, + * because do_user_addr_fault() is too convoluted to track things. + */ +#define X86_EFLAGS_MISC (1UL << 3) + +/* * Handle faults in the user portion of the address space. Nothing in here * should check X86_PF_USER without a specific justification: for almost * all purposes, we should treat a normal kernel access to user memory @@ -1290,13 +1296,11 @@ void do_user_addr_fault(struct pt_regs * * User-mode registers count as a user access even for any * potential system fault or CPU buglet: */ - if (user_mode(regs)) { - local_irq_enable(); + if (user_mode(regs)) flags |= FAULT_FLAG_USER; - } else { - if (regs->flags & X86_EFLAGS_IF) - local_irq_enable(); - } + + irqentry_irq_enable(regs); + regs->flags |= X86_EFLAGS_MISC; perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); @@ -1483,14 +1487,10 @@ handle_page_fault(struct pt_regs *regs, do_kern_addr_fault(regs, error_code, address); } else { do_user_addr_fault(regs, error_code, address); - /* - * User address page fault handling might have reenabled - * interrupts. Fixing up all potential exit points of - * do_user_addr_fault() and its leaf functions is just not - * doable w/o creating an unholy mess or turning the code - * upside down. - */ - local_irq_disable(); + if (regs->flags & X86_EFLAGS_MISC) { + regs->flags &= ~X86_EFLAGS_MISC; + irqentry_irq_disable(regs); + } } } --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -7,6 +7,7 @@ #include #include #include +#include #include @@ -213,6 +214,29 @@ static inline void local_irq_disable_exi #endif /** + * irqentry_irq_enable - Conditionally enable IRQs from exceptions + * + * Common code for exceptions to (re)enable IRQs, typically done to allow + * from-user exceptions to schedule (since they run on the task stack). + */ +static inline void irqentry_irq_enable(struct pt_regs *regs) +{ + if (!regs_irqs_disabled(regs)) + local_irq_enable(); +} + +/** + * irqentry_irq_disable - Conditionally disable IRQs from exceptions + * + * Counterpart of irqentry_irq_enable(). + */ +static inline void irqentry_irq_disable(struct pt_regs *regs) +{ + if (!regs_irqs_disabled(regs)) + local_irq_disable(); +} + +/** * arch_exit_to_user_mode_work - Architecture specific TIF work for exit * to user mode. * @regs: Pointer to currents pt_regs From patchwork Thu Jan 20 15:55:20 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Zijlstra X-Patchwork-Id: 12718822 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A3D9EC43217 for ; Thu, 20 Jan 2022 16:09:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6345E6B00A0; Thu, 20 Jan 2022 11:09:09 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5BCB76B00A3; Thu, 20 Jan 2022 11:09:09 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 40F9A6B00A2; Thu, 20 Jan 2022 11:09:09 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0095.hostedemail.com [216.40.44.95]) by kanga.kvack.org (Postfix) with ESMTP id 1ABEA6B00A0 for ; Thu, 20 Jan 2022 11:09:09 -0500 (EST) Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id D98F28C587 for ; Thu, 20 Jan 2022 16:09:08 +0000 (UTC) X-FDA: 79051149576.15.3E354F4 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf07.hostedemail.com (Postfix) with ESMTP id 828954006D for ; Thu, 20 Jan 2022 16:09:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=EoDxyQ5zZysrQvDvpMfLfzo72oIZBXHd0US2ONa2HeA=; b=W1SVLlfvjNw49kXCVDmtetyoaS OG+8byKA4715c6elpDeg/wbgm2JC/zo7QUicqBrAPY18AucDUsIPaejraLM+kinfTCi+Oy/nrq1nB coeVbNKzbbYmNiCZYcpH1ImI++ddYe9xcdLN4UlzHnVWZXohkh02tngtvToDLBNP9evE8x2UHbfes enX3GMeI9dM98zKvAjPR1fhnWoh38cS/0SFtme7+OX8UxGhwCuO54bExgJD6BTwe3/sfkUg+PZW6u P/PX2PB5wC0M3U3g69k6+o8e0/fSscsm3z+6IdIkq/bL+/sJPEneccGnxzruBHfvje2Fd+OPVQjP0 cr0tEZWQ==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1nAZza-00EQ4F-3H; Thu, 20 Jan 2022 16:08:54 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits)) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id B763C300222; Thu, 20 Jan 2022 17:08:51 +0100 (CET) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 0) id 9A08120280188; Thu, 20 Jan 2022 17:08:51 +0100 (CET) Message-ID: <20220120160822.790430899@infradead.org> User-Agent: quilt/0.66 Date: Thu, 20 Jan 2022 16:55:20 +0100 From: Peter Zijlstra To: mingo@redhat.com, tglx@linutronix.de, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-api@vger.kernel.org, x86@kernel.org, peterz@infradead.org, pjt@google.com, posk@google.com, avagin@google.com, jannh@google.com, tdelisle@uwaterloo.ca, mark.rutland@arm.com, posk@posk.io Subject: [RFC][PATCH v2 3/5] sched/umcg: add WF_CURRENT_CPU and externise ttwu References: <20220120155517.066795336@infradead.org> MIME-Version: 1.0 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 828954006D X-Stat-Signature: mxezm7b15nhtmjpyi3xpbxfeygfq16zt Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=W1SVLlfv; dmarc=none; spf=none (imf07.hostedemail.com: domain of peterz@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=peterz@infradead.org X-HE-Tag: 1642694947-43367 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Peter Oskolkov Add WF_CURRENT_CPU wake flag that advices the scheduler to move the wakee to the current CPU. This is useful for fast on-CPU context switching use cases such as UMCG. In addition, make ttwu external rather than static so that the flag could be passed to it from outside of sched/core.c. Signed-off-by: Peter Oskolkov Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20211122211327.5931-2-posk@google.com --- kernel/sched/core.c | 3 +-- kernel/sched/fair.c | 4 ++++ kernel/sched/sched.h | 17 ++++++++++------- 3 files changed, 15 insertions(+), 9 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3980,8 +3980,7 @@ bool ttwu_state_match(struct task_struct * Return: %true if @p->state changes (an actual wakeup was done), * %false otherwise. */ -static int -try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) +int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) { unsigned long flags; int cpu, success = 0; --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6838,6 +6838,10 @@ select_task_rq_fair(struct task_struct * if (wake_flags & WF_TTWU) { record_wakee(p); + if ((wake_flags & WF_CURRENT_CPU) && + cpumask_test_cpu(cpu, p->cpus_ptr)) + return cpu; + if (sched_energy_enabled()) { new_cpu = find_energy_efficient_cpu(p, prev_cpu); if (new_cpu >= 0) --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2052,13 +2052,14 @@ static inline int task_on_rq_migrating(s } /* Wake flags. The first three directly map to some SD flag value */ -#define WF_EXEC 0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */ -#define WF_FORK 0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */ -#define WF_TTWU 0x08 /* Wakeup; maps to SD_BALANCE_WAKE */ - -#define WF_SYNC 0x10 /* Waker goes to sleep after wakeup */ -#define WF_MIGRATED 0x20 /* Internal use, task got migrated */ -#define WF_ON_CPU 0x40 /* Wakee is on_cpu */ +#define WF_EXEC 0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */ +#define WF_FORK 0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */ +#define WF_TTWU 0x08 /* Wakeup; maps to SD_BALANCE_WAKE */ + +#define WF_SYNC 0x10 /* Waker goes to sleep after wakeup */ +#define WF_MIGRATED 0x20 /* Internal use, task got migrated */ +#define WF_ON_CPU 0x40 /* Wakee is on_cpu */ +#define WF_CURRENT_CPU 0x80 /* Prefer to move the wakee to the current CPU. */ #ifdef CONFIG_SMP static_assert(WF_EXEC == SD_BALANCE_EXEC); @@ -3112,6 +3113,8 @@ static inline bool is_per_cpu_kthread(st extern void swake_up_all_locked(struct swait_queue_head *q); extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait); +extern int try_to_wake_up(struct task_struct *tsk, unsigned int state, int wake_flags); + #ifdef CONFIG_PREEMPT_DYNAMIC extern int preempt_dynamic_mode; extern int sched_dynamic_mode(const char *str); From patchwork Thu Jan 20 15:55:21 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Zijlstra X-Patchwork-Id: 12718824 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D3BA0C4332F for ; Thu, 20 Jan 2022 16:09:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DAD896B00A2; Thu, 20 Jan 2022 11:09:09 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CE9596B00A3; Thu, 20 Jan 2022 11:09:09 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B3A716B00A6; Thu, 20 Jan 2022 11:09:09 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0139.hostedemail.com [216.40.44.139]) by kanga.kvack.org (Postfix) with ESMTP id A0A1C6B00A3 for ; Thu, 20 Jan 2022 11:09:09 -0500 (EST) Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 64D498249980 for ; Thu, 20 Jan 2022 16:09:09 +0000 (UTC) X-FDA: 79051149618.10.B45DA92 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf29.hostedemail.com (Postfix) with ESMTP id F01A6120007 for ; Thu, 20 Jan 2022 16:09:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=O4yxE5iBljUnz9jsYfJACEXe1y1JTPG27a4exjE3DTU=; b=N1I9L8vqLIoEjF6t+ZNSkBly0b sqgDFRzLWknIOXl6tCRnqlMxOkJ/xXXo53bEjoDT5ovaZrlI+PdcAdAcl+olnMfIc3AlP74Sm61zg 04j9SRoF5wGwCOUgF/VTuIXqL18OZ/fqq9ZDzKpoRgZqEHEi0gkRLa2DV20ZayIEI+Z42mPV4ytfY 6WnjWTTQ259UHUGnLykfNUS0KQl1hh9aPB/YrVlnALNschhfq1zdk3snBH+VtqtRYzHfV209c4M0w Vy3p04hlypEiQRgqnrnHJ3hdXXS2575F5iRfflIG0unT1/68dQwDJ9yDTnnXS6VtSFTuuZuGG4Qj2 aW9583Iw==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1nAZza-00EQ4G-4H; Thu, 20 Jan 2022 16:08:54 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits)) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id BFE5830077D; Thu, 20 Jan 2022 17:08:51 +0100 (CET) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 0) id A0A6A20279CE0; Thu, 20 Jan 2022 17:08:51 +0100 (CET) Message-ID: <20220120160822.852009966@infradead.org> User-Agent: quilt/0.66 Date: Thu, 20 Jan 2022 16:55:21 +0100 From: Peter Zijlstra To: mingo@redhat.com, tglx@linutronix.de, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-api@vger.kernel.org, x86@kernel.org, peterz@infradead.org, pjt@google.com, posk@google.com, avagin@google.com, jannh@google.com, tdelisle@uwaterloo.ca, mark.rutland@arm.com, posk@posk.io, Sean Christopherson Subject: [RFC][PATCH v2 4/5] x86/uaccess: Implement unsafe_try_cmpxchg_user() References: <20220120155517.066795336@infradead.org> MIME-Version: 1.0 Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=N1I9L8vq; dmarc=none; spf=none (imf29.hostedemail.com: domain of peterz@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=peterz@infradead.org X-Stat-Signature: mmecf57kj47drpoj6uwqs5ynezf7j3nb X-Rspamd-Queue-Id: F01A6120007 X-Rspamd-Server: rspam12 X-HE-Tag: 1642694948-426941 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Do try_cmpxchg() loops on userspace addresses. Cc: Sean Christopherson Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sean Christopherson --- arch/x86/include/asm/uaccess.h | 67 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 67 insertions(+) --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -342,6 +342,24 @@ do { \ : [umem] "m" (__m(addr)) \ : : label) +#define __try_cmpxchg_user_asm(itype, ltype, _ptr, _pold, _new, label) ({ \ + bool success; \ + __typeof__(_ptr) _old = (__typeof__(_ptr))(_pold); \ + __typeof__(*(_ptr)) __old = *_old; \ + __typeof__(*(_ptr)) __new = (_new); \ + asm_volatile_goto("\n" \ + "1: " LOCK_PREFIX "cmpxchg"itype" %[new], %[ptr]\n"\ + _ASM_EXTABLE_UA(1b, %l[label]) \ + : CC_OUT(z) (success), \ + [ptr] "+m" (*_ptr), \ + [old] "+a" (__old) \ + : [new] ltype (__new) \ + : "memory", "cc" \ + : label); \ + if (unlikely(!success)) \ + *_old = __old; \ + likely(success); }) + #else // !CONFIG_CC_HAS_ASM_GOTO_OUTPUT #ifdef CONFIG_X86_32 @@ -407,6 +425,30 @@ do { \ : [umem] "m" (__m(addr)), \ "0" (err)) +#define __try_cmpxchg_user_asm(itype, ltype, _ptr, _pold, _new, label) ({ \ + int __err = 0; \ + bool success; \ + __typeof__(_ptr) _old = (__typeof__(_ptr))(_pold); \ + __typeof__(*(_ptr)) __old = *_old; \ + __typeof__(*(_ptr)) __new = (_new); \ + asm volatile("\n" \ + "1: " LOCK_PREFIX "cmpxchg"itype" %[new], %[ptr]\n"\ + CC_SET(z) \ + "2:\n" \ + _ASM_EXTABLE_TYPE_REG(1b, 2b, EX_TYPE_EFAULT_REG, \ + %[errout]) \ + : CC_OUT(z) (success), \ + [errout] "+r" (__err), \ + [ptr] "+m" (*_ptr), \ + [old] "+a" (__old) \ + : [new] ltype (__new) \ + : "memory", "cc"); \ + if (unlikely(__err)) \ + goto label; \ + if (unlikely(!success)) \ + *_old = __old; \ + likely(success); }) + #endif // CONFIG_CC_HAS_ASM_GOTO_OUTPUT /* FIXME: this hack is definitely wrong -AK */ @@ -501,6 +543,31 @@ do { \ } while (0) #endif // CONFIG_CC_HAS_ASM_GOTO_OUTPUT +extern void __try_cmpxchg_user_wrong_size(void); + +#define unsafe_try_cmpxchg_user(_ptr, _oldp, _nval, _label) ({ \ + __typeof__(*(_ptr)) __ret; \ + switch (sizeof(__ret)) { \ + case 1: __ret = __try_cmpxchg_user_asm("b", "q", \ + (_ptr), (_oldp), \ + (_nval), _label); \ + break; \ + case 2: __ret = __try_cmpxchg_user_asm("w", "r", \ + (_ptr), (_oldp), \ + (_nval), _label); \ + break; \ + case 4: __ret = __try_cmpxchg_user_asm("l", "r", \ + (_ptr), (_oldp), \ + (_nval), _label); \ + break; \ + case 8: __ret = __try_cmpxchg_user_asm("q", "r", \ + (_ptr), (_oldp), \ + (_nval), _label); \ + break; \ + default: __try_cmpxchg_user_wrong_size(); \ + } \ + __ret; }) + /* * We want the unsafe accessors to always be inlined and use * the error labels - thus the macro games. From patchwork Thu Jan 20 15:55:22 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Zijlstra X-Patchwork-Id: 12718826 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2317DC43217 for ; Thu, 20 Jan 2022 16:09:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B16D46B00A5; Thu, 20 Jan 2022 11:09:12 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A9FB36B00A6; Thu, 20 Jan 2022 11:09:12 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 82F636B00A7; Thu, 20 Jan 2022 11:09:12 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0021.hostedemail.com [216.40.44.21]) by kanga.kvack.org (Postfix) with ESMTP id 6489B6B00A5 for ; Thu, 20 Jan 2022 11:09:12 -0500 (EST) Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 21588181DF748 for ; Thu, 20 Jan 2022 16:09:12 +0000 (UTC) X-FDA: 79051149744.11.B04F942 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf27.hostedemail.com (Postfix) with ESMTP id ED1BE40071 for ; Thu, 20 Jan 2022 16:09:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=8eqn4orp9uZs2ALN7+MQLw07pYL9NVrHMSm52yyoOVY=; b=Gnby2OYpZXQlvgpaSAfrMYB+Wq 8jx8/JZcu+JTpXJJqI9oeFX6FFz2B68lv5fHQRxUxD5zslMgXrtFEkwhSMTT6cGscrRpNVh5Kdzlv TOuRhmBTS5jJtnifuOH9a0YA/X0O5NODJ6iROimrr6TwKZruZH2XHonM/83pw9HOsqToCej/H4Dwq pgjd11ctoXKj5JONPcsN9Fv8dr5Ldo1VrQWdioQZ92b2cCoPwcDeaBo3ZmHPR9LU1C7fgMXNby/nn sseYqF7YGMxJxRHwWLwMS9Zb6/djdLh97WirWa5FOOYJL+ZMOb1+gtgJsQPBD4SyE/KeLrjAaNJnW aPzPozmA==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1nAZza-00EQ4S-Ld; Thu, 20 Jan 2022 16:08:55 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits)) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id EFA0B300B80; Thu, 20 Jan 2022 17:08:52 +0100 (CET) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 0) id A41D12021C31E; Thu, 20 Jan 2022 17:08:51 +0100 (CET) Message-ID: <20220120160822.914418096@infradead.org> User-Agent: quilt/0.66 Date: Thu, 20 Jan 2022 16:55:22 +0100 From: Peter Zijlstra To: mingo@redhat.com, tglx@linutronix.de, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-api@vger.kernel.org, x86@kernel.org, peterz@infradead.org, pjt@google.com, posk@google.com, avagin@google.com, jannh@google.com, tdelisle@uwaterloo.ca, mark.rutland@arm.com, posk@posk.io Subject: [RFC][PATCH v2 5/5] sched: User Mode Concurency Groups References: <20220120155517.066795336@infradead.org> MIME-Version: 1.0 X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: ED1BE40071 X-Stat-Signature: p49mhxcufhwyk5nk8t1ybdiwd6jqdcff Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=Gnby2OYp; spf=none (imf27.hostedemail.com: domain of peterz@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=peterz@infradead.org; dmarc=none X-HE-Tag: 1642694950-464061 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: User Managed Concurrency Groups is an M:N threading toolkit that allows constructing user space schedulers designed to efficiently manage heterogeneous in-process workloads while maintaining high CPU utilization (95%+). XXX moar changelog explaining how this is moar awesome than traditional user-space threading. The big thing that's still missing is the SMP wake-to-remote-idle. Originally-by: Peter Oskolkov Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20211214205358.701701555@infradead.org --- arch/x86/Kconfig | 1 arch/x86/entry/syscalls/syscall_64.tbl | 3 arch/x86/include/asm/thread_info.h | 2 fs/exec.c | 1 include/linux/entry-common.h | 16 include/linux/sched.h | 82 ++ include/linux/syscalls.h | 4 include/linux/thread_info.h | 2 include/uapi/asm-generic/unistd.h | 9 include/uapi/linux/umcg.h | 141 ++++ init/Kconfig | 15 kernel/entry/common.c | 18 kernel/exit.c | 5 kernel/sched/Makefile | 1 kernel/sched/core.c | 9 kernel/sched/umcg.c | 954 +++++++++++++++++++++++++++++++++ kernel/sys_ni.c | 5 17 files changed, 1259 insertions(+), 9 deletions(-) --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -248,6 +248,7 @@ config X86 select HAVE_RSEQ select HAVE_SYSCALL_TRACEPOINTS select HAVE_UNSTABLE_SCHED_CLOCK + select HAVE_UMCG if X86_64 select HAVE_USER_RETURN_NOTIFIER select HAVE_GENERIC_VDSO select HOTPLUG_SMT if SMP --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -371,6 +371,9 @@ 447 common memfd_secret sys_memfd_secret 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv +450 common umcg_ctl sys_umcg_ctl +451 common umcg_wait sys_umcg_wait +452 common umcg_kick sys_umcg_kick # # Due to a historical design error, certain syscalls are numbered differently --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -83,6 +83,7 @@ struct thread_info { #define TIF_NEED_RESCHED 3 /* rescheduling necessary */ #define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/ #define TIF_SSBD 5 /* Speculative store bypass disable */ +#define TIF_UMCG 6 /* UMCG return to user hook */ #define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */ #define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */ #define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */ @@ -107,6 +108,7 @@ struct thread_info { #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) #define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP) #define _TIF_SSBD (1 << TIF_SSBD) +#define _TIF_UMCG (1 << TIF_UMCG) #define _TIF_SPEC_IB (1 << TIF_SPEC_IB) #define _TIF_SPEC_L1D_FLUSH (1 << TIF_SPEC_L1D_FLUSH) #define _TIF_USER_RETURN_NOTIFY (1 << TIF_USER_RETURN_NOTIFY) --- a/fs/exec.c +++ b/fs/exec.c @@ -1838,6 +1838,7 @@ static int bprm_execve(struct linux_binp current->fs->in_exec = 0; current->in_execve = 0; rseq_execve(current); + umcg_execve(current); acct_update_integrals(current); task_numa_free(current, false); return retval; --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -23,6 +23,10 @@ # define _TIF_UPROBE (0) #endif +#ifndef _TIF_UMCG +# define _TIF_UMCG (0) +#endif + /* * SYSCALL_WORK flags handled in syscall_enter_from_user_mode() */ @@ -43,11 +47,13 @@ SYSCALL_WORK_SYSCALL_EMU | \ SYSCALL_WORK_SYSCALL_AUDIT | \ SYSCALL_WORK_SYSCALL_USER_DISPATCH | \ + SYSCALL_WORK_SYSCALL_UMCG | \ ARCH_SYSCALL_WORK_ENTER) #define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \ SYSCALL_WORK_SYSCALL_TRACE | \ SYSCALL_WORK_SYSCALL_AUDIT | \ SYSCALL_WORK_SYSCALL_USER_DISPATCH | \ + SYSCALL_WORK_SYSCALL_UMCG | \ SYSCALL_WORK_SYSCALL_EXIT_TRAP | \ ARCH_SYSCALL_WORK_EXIT) @@ -221,8 +227,11 @@ static inline void local_irq_disable_exi */ static inline void irqentry_irq_enable(struct pt_regs *regs) { - if (!regs_irqs_disabled(regs)) + if (!regs_irqs_disabled(regs)) { local_irq_enable(); + if (user_mode(regs) && (current->flags & PF_UMCG_WORKER)) + umcg_sys_enter(regs, -1); + } } /** @@ -232,8 +241,11 @@ static inline void irqentry_irq_enable(s */ static inline void irqentry_irq_disable(struct pt_regs *regs) { - if (!regs_irqs_disabled(regs)) + if (!regs_irqs_disabled(regs)) { + if (user_mode(regs) && (current->flags & PF_UMCG_WORKER)) + umcg_sys_exit(regs); local_irq_disable(); + } } /** --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -67,6 +67,7 @@ struct sighand_struct; struct signal_struct; struct task_delay_info; struct task_group; +struct umcg_task; /* * Task state bitmask. NOTE! These bits are also @@ -1294,6 +1295,19 @@ struct task_struct { unsigned long rseq_event_mask; #endif +#ifdef CONFIG_UMCG + /* setup by sys_umcg_ctrl() */ + clockid_t umcg_clock; + struct umcg_task __user *umcg_task; + + /* setup by umcg_pin_enter() */ + struct page *umcg_page; + + struct task_struct *umcg_server; + struct umcg_task __user *umcg_server_task; + struct page *umcg_server_page; +#endif + struct tlbflush_unmap_batch tlb_ubc; union { @@ -1687,6 +1701,13 @@ extern struct pid *cad_pid; #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ + +#ifdef CONFIG_UMCG +#define PF_UMCG_WORKER 0x01000000 /* UMCG worker */ +#else +#define PF_UMCG_WORKER 0x00000000 +#endif + #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */ #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */ #define PF_MEMALLOC_PIN 0x10000000 /* Allocation context constrained to zones which allow long term pinning. */ @@ -2294,6 +2315,67 @@ static inline void rseq_execve(struct ta { } +#endif + +#ifdef CONFIG_UMCG + +extern void umcg_sys_enter(struct pt_regs *regs, long syscall); +extern void umcg_sys_exit(struct pt_regs *regs); +extern void umcg_notify_resume(struct pt_regs *regs); +extern void umcg_worker_exit(void); +extern void umcg_clear_child(struct task_struct *tsk); + +/* Called by bprm_execve() in fs/exec.c. */ +static inline void umcg_execve(struct task_struct *tsk) +{ + if (tsk->umcg_task) + umcg_clear_child(tsk); +} + +/* Called by do_exit() in kernel/exit.c. */ +static inline void umcg_handle_exit(void) +{ + if (current->flags & PF_UMCG_WORKER) + umcg_worker_exit(); +} + +/* + * umcg_wq_worker_[sleeping|running] are called in core.c by + * sched_submit_work() and sched_update_worker(). + */ +extern void umcg_wq_worker_sleeping(struct task_struct *tsk); +extern void umcg_wq_worker_running(struct task_struct *tsk); + +#else /* CONFIG_UMCG */ + +static inline void umcg_sys_enter(struct pt_regs *regs, long syscall) +{ +} + +static inline void umcg_sys_exit(struct pt_regs *regs) +{ +} + +static inline void umcg_notify_resume(struct pt_regs *regs) +{ +} + +static inline void umcg_clear_child(struct task_struct *tsk) +{ +} +static inline void umcg_execve(struct task_struct *tsk) +{ +} +static inline void umcg_handle_exit(void) +{ +} +static inline void umcg_wq_worker_sleeping(struct task_struct *tsk) +{ +} +static inline void umcg_wq_worker_running(struct task_struct *tsk) +{ +} + #endif #ifdef CONFIG_DEBUG_RSEQ --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -72,6 +72,7 @@ struct open_how; struct mount_attr; struct landlock_ruleset_attr; enum landlock_rule_type; +struct umcg_task; #include #include @@ -1057,6 +1058,9 @@ asmlinkage long sys_landlock_add_rule(in const void __user *rule_attr, __u32 flags); asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags); asmlinkage long sys_memfd_secret(unsigned int flags); +asmlinkage long sys_umcg_ctl(u32 flags, struct umcg_task __user *self, clockid_t which_clock); +asmlinkage long sys_umcg_wait(u32 flags, u64 abs_timeout); +asmlinkage long sys_umcg_kick(u32 flags, pid_t tid); /* * Architecture-specific system calls --- a/include/linux/thread_info.h +++ b/include/linux/thread_info.h @@ -46,6 +46,7 @@ enum syscall_work_bit { SYSCALL_WORK_BIT_SYSCALL_AUDIT, SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH, SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP, + SYSCALL_WORK_BIT_SYSCALL_UMCG, }; #define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP) @@ -55,6 +56,7 @@ enum syscall_work_bit { #define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT) #define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH) #define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP) +#define SYSCALL_WORK_SYSCALL_UMCG BIT(SYSCALL_WORK_BIT_SYSCALL_UMCG) #endif #include --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -883,8 +883,15 @@ __SYSCALL(__NR_process_mrelease, sys_pro #define __NR_futex_waitv 449 __SYSCALL(__NR_futex_waitv, sys_futex_waitv) +#define __NR_umcg_ctl 450 +__SYSCALL(__NR_umcg_ctl, sys_umcg_ctl) +#define __NR_umcg_wait 451 +__SYSCALL(__NR_umcg_wait, sys_umcg_wait) +#define __NR_umcg_kick 452 +__SYSCALL(__NR_umcg_kick, sys_umcg_kick) + #undef __NR_syscalls -#define __NR_syscalls 450 +#define __NR_syscalls 453 /* * 32 bit systems traditionally used different --- /dev/null +++ b/include/uapi/linux/umcg.h @@ -0,0 +1,141 @@ +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */ +#ifndef _UAPI_LINUX_UMCG_H +#define _UAPI_LINUX_UMCG_H + +#include + +/* + * UMCG: User Managed Concurrency Groups. + * + * Syscalls (see kernel/sched/umcg.c): + * sys_umcg_ctl() - register/unregister UMCG tasks; + * sys_umcg_wait() - wait/wake/context-switch. + * sys_umcg_kick() - prod a UMCG task + * + * struct umcg_task (below): controls the state of UMCG tasks. + */ + +/* + * UMCG task states, the first 8 bits of struct umcg_task.state. + * + * ,--------(TF_PREEMPT + notify_resume)-------. ,----------. + * | v | | + * RUNNING -(schedule)-> BLOCKED -(sys_exit)-> RUNNABLE (signal + notify_resume) + * ^ ^ | | ^ | + * | `-----(schedule)-----' | | | + * `--------------(sys_umcg_wait)----------------' `----------' + * + */ +#define UMCG_TASK_NONE 0x0000U +#define UMCG_TASK_RUNNING 0x0001U +#define UMCG_TASK_RUNNABLE 0x0002U +#define UMCG_TASK_BLOCKED 0x0003U + +#define UMCG_TASK_MASK 0x00ffU + +/* + * UMCG_TF_PREEMPT: userspace indicates the worker should be preempted. + * + * Must only be set on UMCG_TASK_RUNNING; once set, any subsequent + * return-to-user (eg sys_umcg_kick()) will perform the equivalent of + * sys_umcg_wait() on it. That is, it will wake next_tid/server_tid, transfer + * to RUNNABLE and enqueue on the server's runnable list. + */ +#define UMCG_TF_PREEMPT 0x0100U +/* + * UMCG_TF_COND_WAIT: indicate the task *will* call sys_umcg_wait() + * + * Enables server loops like (vs umcg_sys_exit()): + * + * for(;;) { + * self->state = UMCG_TASK_RUNNABLE | UMCG_TF_COND_WAIT; + * + * // smp_mb() implied by xchg() + * runnable_ptr = (void *)xchg(self->runnable_workers_ptr, NULL); + * while (runnable_ptr) { + * next = (void *)runnable_ptr->runnable_workers_ptr; + * umcg_server_add_runnable(self, runnable_ptr); + * runnable_ptr = next; + * } + * + * self->next_tid = umcg_server_pick_next(self); + * sys_umcg_wait(0, 0); + * } + * + * without a signal or interrupt in between setting umcg_task::state and + * sys_umcg_wait() resulting in an infinite wait in umcg_notify_resume(). + */ +#define UMCG_TF_COND_WAIT 0x0200U + +#define UMCG_TF_MASK 0xff00U + +#define UMCG_TASK_ALIGN 64 + +/** + * struct umcg_task - controls the state of UMCG tasks. + * + * The struct is aligned at 64 bytes to ensure that it fits into + * a single cache line. + */ +struct umcg_task { + /** + * @state: the current state of the UMCG task described by + * this struct. + * + * Readable/writable by both the kernel and the userspace. + * + * UMCG task state: + * bits 0 - 7: task state; + * bits 8 - 15: state flags; + * bits 16 - 31: for userspace use; + */ + __u32 state; /* r/w */ + +#define UMCG_TID_RUNNING 0x80000000U +#define UMCG_TID_MASK 0x3fffffffU + /** + * @next_tid: the TID of the UMCG task that should be context-switched + * into in sys_umcg_wait(). Can be zero. + * + * @server_tid: the TID of the UMCG server that hosts this task, + * when RUNNABLE this task will get added to it's + * runnable_workers_ptr list. + * + * Read-only for the kernel, read/write for the userspace. + */ + __u32 next_tid; /* r */ + __u32 server_tid; /* r */ + + __u32 __hole[1]; + + /* + * Timestamps for when last we became BLOCKED, RUNNABLE. + */ + __u64 blocked_ts; /* w */ + __u64 runnable_ts; /* w */ + + /** + * @runnable_workers_ptr: a single-linked list of runnable workers. + * + * Readable/writable by both the kernel and the userspace: the + * kernel adds items to the list, userspace removes them. + */ + __u64 runnable_workers_ptr; /* r/w */ + + __u64 __zero[3]; + +} __attribute__((packed, aligned(UMCG_TASK_ALIGN))); + +/** + * enum umcg_ctl_flag - flags to pass to sys_umcg_ctl + * @UMCG_CTL_REGISTER: register the current task as a UMCG task + * @UMCG_CTL_UNREGISTER: unregister the current task as a UMCG task + * @UMCG_CTL_WORKER: register the current task as a UMCG worker + */ +enum umcg_ctl_flag { + UMCG_CTL_REGISTER = 0x00001, + UMCG_CTL_UNREGISTER = 0x00002, + UMCG_CTL_WORKER = 0x10000, +}; + +#endif /* _UAPI_LINUX_UMCG_H */ --- a/init/Kconfig +++ b/init/Kconfig @@ -1686,6 +1686,21 @@ config MEMBARRIER If unsure, say Y. +config HAVE_UMCG + bool + +config UMCG + bool "Enable User Managed Concurrency Groups API" + depends on 64BIT + depends on GENERIC_ENTRY + depends on HAVE_UMCG + default n + help + Enable User Managed Concurrency Groups API, which form the basis + for an in-process M:N userspace scheduling framework. + At the moment this is an experimental/RFC feature that is not + guaranteed to be backward-compatible. + config KALLSYMS bool "Load all symbols for debugging/ksymoops" if EXPERT default y --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -6,6 +6,7 @@ #include #include #include +#include #include "common.h" @@ -76,6 +77,9 @@ static long syscall_trace_enter(struct p if (unlikely(work & SYSCALL_WORK_SYSCALL_TRACEPOINT)) trace_sys_enter(regs, syscall); + if (work & SYSCALL_WORK_SYSCALL_UMCG) + umcg_sys_enter(regs, syscall); + syscall_enter_audit(regs, syscall); return ret ? : syscall; @@ -155,8 +159,7 @@ static unsigned long exit_to_user_mode_l * Before returning to user space ensure that all pending work * items have been completed. */ - while (ti_work & EXIT_TO_USER_MODE_WORK) { - + do { local_irq_enable_exit_to_user(ti_work); if (ti_work & _TIF_NEED_RESCHED) @@ -168,6 +171,10 @@ static unsigned long exit_to_user_mode_l if (ti_work & _TIF_PATCH_PENDING) klp_update_patch_state(current); + /* must be before handle_signal_work(); terminates on sigpending */ + if (ti_work & _TIF_UMCG) + umcg_notify_resume(regs); + if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) handle_signal_work(regs, ti_work); @@ -188,7 +195,7 @@ static unsigned long exit_to_user_mode_l tick_nohz_user_enter_prepare(); ti_work = read_thread_flags(); - } + } while (ti_work & EXIT_TO_USER_MODE_WORK); /* Return the latest work state for arch_exit_to_user_mode() */ return ti_work; @@ -203,7 +210,7 @@ static void exit_to_user_mode_prepare(st /* Flush pending rcuog wakeup before the last need_resched() check */ tick_nohz_user_enter_prepare(); - if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK)) + if (unlikely(ti_work & (EXIT_TO_USER_MODE_WORK | _TIF_UMCG))) ti_work = exit_to_user_mode_loop(regs, ti_work); arch_exit_to_user_mode_prepare(regs, ti_work); @@ -253,6 +260,9 @@ static void syscall_exit_work(struct pt_ step = report_single_step(work); if (step || work & SYSCALL_WORK_SYSCALL_TRACE) arch_syscall_exit_tracehook(regs, step); + + if (work & SYSCALL_WORK_SYSCALL_UMCG) + umcg_sys_exit(regs); } /* --- a/kernel/exit.c +++ b/kernel/exit.c @@ -749,6 +749,10 @@ void __noreturn do_exit(long code) if (unlikely(!tsk->pid)) panic("Attempted to kill the idle task!"); + /* Turn off UMCG sched hooks. */ + if (unlikely(tsk->flags & PF_UMCG_WORKER)) + tsk->flags &= ~PF_UMCG_WORKER; + /* * If do_exit is called because this processes oopsed, it's possible * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before @@ -786,6 +790,7 @@ void __noreturn do_exit(long code) io_uring_files_cancel(); exit_signals(tsk); /* sets PF_EXITING */ + umcg_handle_exit(); /* sync mm's RSS info before statistics gathering */ if (tsk->mm) --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -41,3 +41,4 @@ obj-$(CONFIG_MEMBARRIER) += membarrier.o obj-$(CONFIG_CPU_ISOLATION) += isolation.o obj-$(CONFIG_PSI) += psi.o obj-$(CONFIG_SCHED_CORE) += core_sched.o +obj-$(CONFIG_UMCG) += umcg.o --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4272,6 +4272,7 @@ static void __sched_fork(unsigned long c p->wake_entry.u_flags = CSD_TYPE_TTWU; p->migration_pending = NULL; #endif + umcg_clear_child(p); } DEFINE_STATIC_KEY_FALSE(sched_numa_balancing); @@ -6330,9 +6331,11 @@ static inline void sched_submit_work(str * If a worker goes to sleep, notify and ask workqueue whether it * wants to wake up a task to maintain concurrency. */ - if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) { + if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) { if (task_flags & PF_WQ_WORKER) wq_worker_sleeping(tsk); + else if (task_flags & PF_UMCG_WORKER) + umcg_wq_worker_sleeping(tsk); else io_wq_worker_sleeping(tsk); } @@ -6350,9 +6353,11 @@ static inline void sched_submit_work(str static void sched_update_worker(struct task_struct *tsk) { - if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) { + if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) { if (tsk->flags & PF_WQ_WORKER) wq_worker_running(tsk); + else if (tsk->flags & PF_UMCG_WORKER) + umcg_wq_worker_running(tsk); else io_wq_worker_running(tsk); } --- /dev/null +++ b/kernel/sched/umcg.c @@ -0,0 +1,954 @@ +// SPDX-License-Identifier: GPL-2.0-only + +/* + * User Managed Concurrency Groups (UMCG). + * + */ + +#include +#include +#include +#include +#include + +#include +#include + +#include "sched.h" + +static struct task_struct *umcg_get_task(u32 tid) +{ + struct task_struct *tsk = NULL; + + if (tid) { + rcu_read_lock(); + tsk = find_task_by_vpid(tid & UMCG_TID_MASK); + if (tsk && current->mm == tsk->mm && tsk->umcg_task) + get_task_struct(tsk); + else + tsk = NULL; + rcu_read_unlock(); + } + + return tsk; +} + +/* + * Pinning a page inhibits rmap based unmap for Anon pages. Doing a load + * through the user mapping ensures the user mapping exists. + */ +#define umcg_pin_and_load(_self, _pagep, _member) \ +({ \ + __label__ __out; \ + int __ret = -EFAULT; \ + \ + if (pin_user_pages_fast((unsigned long)(_self), 1, 0, &(_pagep)) != 1) \ + goto __out; \ + \ + if (!PageAnon(_pagep) || \ + get_user(_member, &(_self)->_member)) { \ + unpin_user_page(_pagep); \ + goto __out; \ + } \ + __ret = 0; \ +__out: __ret; \ +}) + +/** + * umcg_pin_pages: pin pages containing struct umcg_task of + * this task and its server (possibly this task again). + */ +static int umcg_pin_pages(void) +{ + struct task_struct *server = NULL, *tsk = current; + struct umcg_task __user *self = READ_ONCE(tsk->umcg_task); + int server_tid; + int ret; + + /* must not have stale state */ + if (WARN_ON_ONCE(tsk->umcg_page || + tsk->umcg_server_page || + tsk->umcg_server_task || + tsk->umcg_server)) + return -EBUSY; + + ret = umcg_pin_and_load(self, tsk->umcg_page, server_tid); + if (ret) + goto clear_self; + + ret = -ESRCH; + server = umcg_get_task(server_tid); + if (!server) + goto unpin_self; + + /* must cache due to possible concurrent change */ + tsk->umcg_server_task = READ_ONCE(server->umcg_task); + ret = umcg_pin_and_load(tsk->umcg_server_task, tsk->umcg_server_page, server_tid); + if (ret) + goto clear_server; + + tsk->umcg_server = server; + + return 0; + +clear_server: + tsk->umcg_server_task = NULL; + tsk->umcg_server_page = NULL; + +unpin_self: + unpin_user_page(tsk->umcg_page); +clear_self: + tsk->umcg_page = NULL; + + return ret; +} + +static void umcg_unpin_pages(void) +{ + struct task_struct *tsk = current; + + if (tsk->umcg_server) { + unpin_user_page(tsk->umcg_page); + tsk->umcg_page = NULL; + + unpin_user_page(tsk->umcg_server_page); + tsk->umcg_server_page = NULL; + tsk->umcg_server_task = NULL; + + put_task_struct(tsk->umcg_server); + tsk->umcg_server = NULL; + } +} + +static void umcg_clear_task(struct task_struct *tsk) +{ + /* + * This is either called for the current task, or for a newly forked + * task that is not yet running, so we don't need strict atomicity + * below. + */ + if (tsk->umcg_task) { + WRITE_ONCE(tsk->umcg_task, NULL); + tsk->umcg_page = NULL; + + tsk->umcg_server = NULL; + tsk->umcg_server_page = NULL; + tsk->umcg_server_task = NULL; + + tsk->flags &= ~PF_UMCG_WORKER; + clear_task_syscall_work(tsk, SYSCALL_UMCG); + clear_tsk_thread_flag(tsk, TIF_UMCG); + } +} + +/* Called for a forked or execve-ed child. */ +void umcg_clear_child(struct task_struct *tsk) +{ + umcg_clear_task(tsk); +} + +/* Called both by normally (unregister) and abnormally exiting workers. */ +void umcg_worker_exit(void) +{ + umcg_unpin_pages(); + umcg_clear_task(current); +} + +/* + * Do a state transition: @from -> @to. + * + * Will clear UMCG_TF_PREEMPT, UMCG_TF_COND_WAIT. + * + * When @to == {BLOCKED,RUNNABLE}, update timestamps. + * + * Returns: + * 0: success + * -EAGAIN: when self->state != @from + * -EFAULT + */ +static int umcg_update_state(struct task_struct *tsk, + struct umcg_task __user *self, + u32 from, u32 to) +{ + u32 old, new; + u64 now; + + if (to >= UMCG_TASK_RUNNABLE) { + switch (tsk->umcg_clock) { + case CLOCK_REALTIME: now = ktime_get_real_ns(); break; + case CLOCK_MONOTONIC: now = ktime_get_ns(); break; + case CLOCK_BOOTTIME: now = ktime_get_boottime_ns(); break; + case CLOCK_TAI: now = ktime_get_clocktai_ns(); break; + } + } + + if (!user_access_begin(self, sizeof(*self))) + return -EFAULT; + + unsafe_get_user(old, &self->state, Efault); + do { + if ((old & UMCG_TASK_MASK) != from) + goto fail; + + new = old & ~(UMCG_TASK_MASK | + UMCG_TF_PREEMPT | UMCG_TF_COND_WAIT); + new |= to & UMCG_TASK_MASK; + + } while (!unsafe_try_cmpxchg_user(&self->state, &old, new, Efault)); + + if (to == UMCG_TASK_BLOCKED) + unsafe_put_user(now, &self->blocked_ts, Efault); + if (to == UMCG_TASK_RUNNABLE) + unsafe_put_user(now, &self->runnable_ts, Efault); + + user_access_end(); + return 0; + +fail: + user_access_end(); + return -EAGAIN; + +Efault: + user_access_end(); + return -EFAULT; +} + +#define __UMCG_DIE(stmt, reason) do { \ + stmt; \ + pr_warn_ratelimited("%s: killing task %s/%d because: " reason "\n",\ + __func__, current->comm, current->pid); \ + force_sig(SIGKILL); \ + return; \ +} while (0) + +#define UMCG_DIE(reason) __UMCG_DIE(,reason) +#define UMCG_DIE_PF(reason) __UMCG_DIE(pagefault_enable(), reason) +#define UMCG_DIE_UNPIN(reason) __UMCG_DIE(umcg_unpin_pages(), reason) + +/* Called from syscall enter path and exceptions that can schedule */ +void umcg_sys_enter(struct pt_regs *regs, long syscall) +{ + /* avoid recursion vs our own syscalls */ + if (syscall == __NR_umcg_wait || + syscall == __NR_umcg_ctl) + return; + + /* avoid recursion vs schedule() */ + current->flags &= ~PF_UMCG_WORKER; + + /* + * Pin all the state on sys_enter() such that we can rely on it + * from dodgy contexts. It is either unpinned from pre-schedule() + * or sys_exit(), whichever comes first, thereby ensuring the pin + * is temporary. + */ + if (umcg_pin_pages()) + UMCG_DIE("pin"); + + current->flags |= PF_UMCG_WORKER; +} + +static int umcg_wake_task(struct task_struct *tsk, struct umcg_task __user *self) +{ + int ret = umcg_update_state(tsk, self, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNING); + if (ret) + return ret; + + try_to_wake_up(tsk, TASK_NORMAL, WF_CURRENT_CPU); + return 0; +} + +static int umcg_wake_server(struct task_struct *tsk) +{ + int ret = umcg_wake_task(tsk->umcg_server, tsk->umcg_server_task); + if (ret == -EAGAIN) { + /* + * Server could have timed-out or already be running + * due to a runnable enqueue. See umcg_sys_exit(). + */ + ret = 0; + } + return ret; +} + +/* pre-schedule() */ +void umcg_wq_worker_sleeping(struct task_struct *tsk) +{ + struct umcg_task __user *self = READ_ONCE(tsk->umcg_task); + int ret; + + if (!tsk->umcg_server) { + /* + * Already blocked before, the pages are unpinned. + */ + return; + } + + /* Must not fault, mmap_sem might be held. */ + pagefault_disable(); + + ret = umcg_update_state(tsk, self, UMCG_TASK_RUNNING, UMCG_TASK_BLOCKED); + if (ret == -EAGAIN) { + /* + * Consider: + * + * self->state = UMCG_TASK_RUNNABLE | UMCG_TF_COND_WAIT; + * ... + * sys_umcg_wait(); + * + * and the '...' code doing a blocking syscall/fault. This + * ensures that returns with UMCG_TASK_RUNNING, which will make + * sys_umcg_wait() return with -EAGAIN. + */ + ret = umcg_update_state(tsk, self, UMCG_TASK_RUNNABLE, UMCG_TASK_BLOCKED); + } + if (ret) + UMCG_DIE_PF("state"); + + if (umcg_wake_server(tsk)) + UMCG_DIE_PF("wake"); + + pagefault_enable(); + + /* + * We're going to sleep, make sure to unpin the pages, this ensures + * the pins are temporary. Also see umcg_sys_exit(). + */ + umcg_unpin_pages(); +} + +/* post-schedule() */ +void umcg_wq_worker_running(struct task_struct *tsk) +{ + /* nothing here, see umcg_sys_exit() */ +} + +/* + * Enqueue @tsk on it's server's runnable list + * + * Must be called in umcg_pin_pages() context, relies on tsk->umcg_server. + * + * cmpxchg based single linked list add such that list integrity is never + * violated. Userspace *MUST* remove it from the list before changing ->state. + * As such, we must change state to RUNNABLE before enqueue. + * + * Returns: + * 0: success + * -EFAULT + */ +static int umcg_enqueue_runnable(struct task_struct *tsk) +{ + struct umcg_task __user *server = tsk->umcg_server_task; + struct umcg_task __user *self = tsk->umcg_task; + u64 first_ptr, *head = &server->runnable_workers_ptr; + u64 self_ptr = (unsigned long)self; + + /* + * umcg_pin_pages() did access_ok() on both pointers, use self here + * only because __user_access_begin() isn't available in generic code. + */ + if (!user_access_begin(self, sizeof(*self))) + return -EFAULT; + + unsafe_get_user(first_ptr, head, Efault); + do { + unsafe_put_user(first_ptr, &self->runnable_workers_ptr, Efault); + } while (!unsafe_try_cmpxchg_user(head, &first_ptr, self_ptr, Efault)); + + user_access_end(); + return 0; + +Efault: + user_access_end(); + return -EFAULT; +} + +static int umcg_enqueue_and_wake(struct task_struct *tsk) +{ + int ret; + + ret = umcg_enqueue_runnable(tsk); + if (!ret) + ret = umcg_wake_server(tsk); + + return ret; +} + +/* + * umcg_wait: Wait for ->state to become RUNNING + * + * Returns: + * 0 - success + * -EINTR - pending signal + * -EINVAL - ::state is not {RUNNABLE,RUNNING} + * -ETIMEDOUT + * -EFAULT + */ +static int umcg_wait(u64 timo) +{ + struct task_struct *tsk = current; + struct umcg_task __user *self = tsk->umcg_task; + struct page *page = NULL; + u32 state; + int ret; + + for (;;) { + set_current_state(TASK_INTERRUPTIBLE); + + ret = -EINTR; + if (signal_pending(current)) + break; + + /* + * Faults can block and scribble our wait state. + */ + pagefault_disable(); + if (get_user(state, &self->state)) { + pagefault_enable(); + __set_current_state(TASK_RUNNING); + + ret = -EFAULT; + if (page) { + unpin_user_page(page); + page = NULL; + break; + } + + ret = umcg_pin_and_load(self, page, state); + if (ret) { + page = NULL; + break; + } + + continue; + } + + if (page) { + unpin_user_page(page); + page = NULL; + } + pagefault_enable(); + + state &= UMCG_TASK_MASK; + if (state != UMCG_TASK_RUNNABLE) { + ret = 0; + if (state == UMCG_TASK_RUNNING) + break; + + ret = -EINVAL; + break; + } + + if (!schedule_hrtimeout_range_clock(timo ? &timo : NULL, + tsk->timer_slack_ns, + HRTIMER_MODE_ABS, + tsk->umcg_clock)) { + ret = -ETIMEDOUT; + break; + } + } + __set_current_state(TASK_RUNNING); + + return ret; +} + +/* + * Blocked case for umcg_sys_exit(), shared with sys_umcg_ctl(). + */ +static void umcg_unblock_and_wait(void) +{ + struct task_struct *tsk = current; + struct umcg_task __user *self = READ_ONCE(tsk->umcg_task); + + /* avoid recursion vs schedule() */ + tsk->flags &= ~PF_UMCG_WORKER; + + if (umcg_pin_pages()) + UMCG_DIE("pin"); + + if (umcg_update_state(tsk, self, UMCG_TASK_BLOCKED, UMCG_TASK_RUNNABLE)) + UMCG_DIE_UNPIN("state"); + + if (umcg_enqueue_and_wake(tsk)) + UMCG_DIE_UNPIN("enqueue-wake"); + + umcg_unpin_pages(); + + switch (umcg_wait(0)) { + case 0: + case -EINTR: + /* notify_resume will continue the wait after the signal */ + break; + + default: + UMCG_DIE("wait"); + } + + tsk->flags |= PF_UMCG_WORKER; +} + +/* Called from syscall exit path and exceptions that can schedule */ +void umcg_sys_exit(struct pt_regs *regs) +{ + struct task_struct *tsk = current; + long syscall = syscall_get_nr(tsk, regs); + + if (syscall == __NR_umcg_wait || + syscall == __NR_umcg_ctl) + return; + + if (tsk->umcg_server) { + /* + * Didn't block, we done. + */ + umcg_unpin_pages(); + return; + } + + umcg_unblock_and_wait(); +} + +/* return-to-user path */ +void umcg_notify_resume(struct pt_regs *regs) +{ + struct task_struct *tsk = current; + struct umcg_task __user *self = tsk->umcg_task; + bool worker = tsk->flags & PF_UMCG_WORKER; + u32 state; + + /* avoid recursion vs schedule() */ + if (worker) + current->flags &= ~PF_UMCG_WORKER; + + if (get_user(state, &self->state)) + UMCG_DIE("get-state"); + + state &= UMCG_TASK_MASK | UMCG_TF_MASK; + if (state == UMCG_TASK_RUNNING) + goto done; + + /* + * See comment at UMCG_TF_COND_WAIT; TL;DR: user *will* call + * sys_umcg_wait() and signals/interrupts shouldn't block + * return-to-user. + */ + if (state == (UMCG_TASK_RUNNABLE | UMCG_TF_COND_WAIT)) + goto done; + + if (state & UMCG_TF_PREEMPT) { + if (umcg_pin_pages()) + UMCG_DIE("pin"); + + if (umcg_update_state(tsk, self, + UMCG_TASK_RUNNING, + UMCG_TASK_RUNNABLE)) + UMCG_DIE_UNPIN("state"); + + if (umcg_enqueue_and_wake(tsk)) + UMCG_DIE_UNPIN("enqueue-wake"); + + umcg_unpin_pages(); + } + + switch (umcg_wait(0)) { + case 0: + case -EINTR: + /* we will resume the wait after the signal */ + break; + + default: + UMCG_DIE("wait"); + } + +done: + if (worker) + current->flags |= PF_UMCG_WORKER; +} + +/** + * sys_umcg_kick: makes a UMCG task cycle through umcg_notify_resume() + * + * Returns: + * 0 - Ok; + * -ESRCH - not a related UMCG task + * -EINVAL - another error happened (unknown flags, etc..) + */ +SYSCALL_DEFINE2(umcg_kick, u32, flags, pid_t, tid) +{ + struct task_struct *task = umcg_get_task(tid); + if (!task) + return -ESRCH; + + if (flags) + return -EINVAL; + +#ifdef CONFIG_SMP + smp_send_reschedule(task_cpu(task)); +#endif + + return 0; +} + +/* + * Handles ::next_tid as per sys_umcg_wait(). + * + * ::next_tid - return + * ----------------------------- + * 0 - 0 (success) + * + * tid - -ESRCH (no such task, or not of this UMCG) + * - -EAGAIN (next::state != RUNNABLE) + * - 0 (success, ::next_tid |= RUNNING) + * + * tid|RUNNING - -EAGAIN (next::state != RUNNING) + * - 0 (success) + * + * Returns: + * 0: success + * -EFAULT + * -ESRCH + * -EAGAIN + */ +static int umcg_wake_next(struct task_struct *tsk, struct umcg_task __user *self) +{ + struct umcg_task __user *next_task; + struct task_struct *next; + u32 next_tid, state; + int ret; + + if (get_user(next_tid, &self->next_tid)) + return -EFAULT; + + if (!next_tid) + return 0; + + next = umcg_get_task(next_tid); + if (!next) + return -ESRCH; + + next_task = READ_ONCE(next->umcg_task); + + if (next_tid & UMCG_TID_RUNNING) { + ret = -EFAULT; + if (get_user(state, &next_task->state)) + goto put_next; + + ret = 0; + if ((state & UMCG_TASK_MASK) != UMCG_TASK_RUNNING) + ret = -EAGAIN; + + } else { + ret = umcg_wake_task(next, next_task); + if (ret) + goto put_next; + + ret = -EFAULT; + if (put_user(next_tid | UMCG_TID_RUNNING, &self->next_tid)) + goto put_next; + + /* + * If this is a worker doing sys_umcg_wait() switching to + * another worker, userspace has the responsibility to update + * server::next_tid. + */ + + ret = 0; + } + +put_next: + put_task_struct(next); + return ret; +} + +/** + * sys_umcg_wait: transfer running context + * + * Called like: + * + * self->state = UMCG_TASK_RUNNABLE | UMCG_TF_COND_WAIT; + * ... + * sys_umcg_wait(0, time); + * + * The syscall will clear TF_COND_WAIT and wait until state becomes RUNNING. + * The code '...' must not contain syscalls + * + * If self->next_tid is set and indicates a valid UMCG task with RUNNABLE state + * that task will be made RUNNING and woken -- transfering the running context + * to that task. In this case self->next_tid is modified with TID_RUNNING to + * indicate self->next_tid is consumed. + * + * If self->next has TID_RUNNING set, it is validated the related task has + * RUNNING state, otherwise -EAGAIN is returned to indicate a new task needs to + * be selected. + * + * If the caller is a worker: + * + * - it will be enqueued on the associated server's runnable_workers_ptr list + * and the server will be woken. + * + * - when ::next_tid is used to affect a worker-to-worker transfer, it is up + * to userspace to keep server::next_tid consistent. + * + * The corrolary is that a server setting ::next_tid to 0 will idle. + * + * Returns: + * 0 - OK; + * -ETIMEDOUT - the timeout expired; + * -ERANGE - the timeout is out of range (worker); + * -EAGAIN - ::state wasn't RUNNABLE, concurrent wakeup; + * -EFAULT - failed accessing struct umcg_task __user of the current + * task, the server or next; + * -ESRCH - the task to wake not found or not a UMCG task; + * -EINVAL - another error happened (e.g. the current task is not a + * UMCG task, etc.) + */ +SYSCALL_DEFINE2(umcg_wait, u32, flags, u64, timo) +{ + struct task_struct *tsk = current; + struct umcg_task __user *self = READ_ONCE(tsk->umcg_task); + bool worker = tsk->flags & PF_UMCG_WORKER; + int ret; + + if (!self || flags) + return -EINVAL; + + if (worker) { + tsk->flags &= ~PF_UMCG_WORKER; + if (timo) + return -ERANGE; + } + + /* see umcg_sys_{enter,exit}() syscall exceptions */ + ret = umcg_pin_pages(); + if (ret) + goto unblock; + + /* + * Clear UMCG_TF_COND_WAIT *and* check state == RUNNABLE. + */ + ret = umcg_update_state(tsk, self, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNABLE); + if (ret) + goto unpin; + + ret = umcg_wake_next(tsk, self); + if (ret) + goto unpin; + + if (worker) { + /* + * If this fails it is possible ::next_tid is already running + * while this task is not going to block. This violates our + * constraints. + * + * That said, pretty much the only way to make this fail is by + * force munmap()'ing things. In which case one is most welcome + * to the pieces. + */ + ret = umcg_enqueue_and_wake(tsk); + if (ret) + goto unpin; + } + + umcg_unpin_pages(); + + ret = umcg_wait(timo); + switch (ret) { + case 0: /* all done */ + case -EINTR: /* umcg_notify_resume() will continue the wait */ + ret = 0; + break; + + default: + goto unblock; + } +out: + if (worker) + tsk->flags |= PF_UMCG_WORKER; + return ret; + +unpin: + umcg_unpin_pages(); +unblock: + umcg_update_state(tsk, self, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNING); + goto out; +} + +static int umcg_register(struct umcg_task __user *self, u32 flags, clockid_t which_clock) +{ + struct task_struct *server; + struct umcg_task ut; + + switch (which_clock) { + case CLOCK_REALTIME: + case CLOCK_MONOTONIC: + case CLOCK_BOOTTIME: + case CLOCK_TAI: + current->umcg_clock = which_clock; + break; + + default: + return -EINVAL; + } + + if (current->umcg_task || !self) + return -EINVAL; + + if (copy_from_user(&ut, self, sizeof(ut))) + return -EFAULT; + + if (ut.next_tid || ut.__hole[0] || ut.__zero[0] || ut.__zero[1] || ut.__zero[2]) + return -EINVAL; + + rcu_read_lock(); + server = find_task_by_vpid(ut.server_tid); + if (server && server->mm == current->mm) { + if (flags == UMCG_CTL_WORKER) { + if (!server->umcg_task || + (server->flags & PF_UMCG_WORKER)) + server = NULL; + } else { + if (server != current) + server = NULL; + } + } else { + server = NULL; + } + rcu_read_unlock(); + + if (!server) + return -ESRCH; + + if (flags == UMCG_CTL_WORKER) { + if ((ut.state & (UMCG_TASK_MASK | UMCG_TF_MASK)) != UMCG_TASK_BLOCKED) + return -EINVAL; + + WRITE_ONCE(current->umcg_task, self); + current->flags |= PF_UMCG_WORKER; /* hook schedule() */ + set_syscall_work(SYSCALL_UMCG); /* hook syscall */ + set_thread_flag(TIF_UMCG); /* hook return-to-user */ + + umcg_unblock_and_wait(); + + } else { + if ((ut.state & (UMCG_TASK_MASK | UMCG_TF_MASK)) != UMCG_TASK_RUNNING) + return -EINVAL; + + WRITE_ONCE(current->umcg_task, self); + set_thread_flag(TIF_UMCG); /* hook return-to-user */ + + /* umcg_notify_resume() would block if not RUNNING */ + } + + return 0; +} + +static int umcg_unregister(struct umcg_task __user *self, u32 flags) +{ + bool worker = current->flags & PF_UMCG_WORKER; + int ret; + + if (!self || self != current->umcg_task) + return -EINVAL; + + if (!worker != !(flags & UMCG_CTL_WORKER)) + return -EINVAL; + + current->flags &= ~PF_UMCG_WORKER; + + ret = umcg_pin_pages(); + if (ret) { + if (worker) + current->flags |= PF_UMCG_WORKER; + return ret; + } + + ret = umcg_update_state(current, self, UMCG_TASK_RUNNING, UMCG_TASK_NONE); + if (ret) { + if (worker) + current->flags |= PF_UMCG_WORKER; + return ret; + } + + if (worker) + umcg_wake_server(current); + + umcg_unpin_pages(); + umcg_clear_task(current); + return 0; +} + +#define UMCG_CTL_CMD 0xff + +/** + * sys_umcg_ctl: (un)register the current task as a UMCG task. + * @flags: ORed values from enum umcg_ctl_flag; see below; + * @self: a pointer to struct umcg_task that describes this + * task and governs the behavior of sys_umcg_wait. + * @which_clock: clockid to use for timestamps and timeouts + * + * @flags & UMCG_CTL_REGISTER: register a UMCG task: + * + * UMCG workers: + * - @flags & UMCG_CTL_WORKER + * - self->state must be UMCG_TASK_BLOCKED + * + * UMCG servers: + * - !(@flags & UMCG_CTL_WORKER) + * - self->state must be UMCG_TASK_RUNNING + * + * All tasks: + * - self->server_tid must be a valid server + * - self->next_tid must be zero + * + * If the conditions above are met, sys_umcg_ctl() immediately returns + * if the registered task is a server. If the registered task is a + * worker it will be added to it's server's runnable_workers_ptr list + * and the server will be woken. + * + * @flags & UMCG_CTL_UNREGISTER: unregister a UMCG task. + * + * UMCG workers: + * - @flags & UMCG_CTL_WORKER + * + * UMCG servers: + * - !(@flags & UMCG_CTL_WORKER) + * + * All tasks: + * - self must match with UMCG_CTL_REGISTER + * - self->state must be UMCG_TASK_RUNNING + * - self->server_tid must be a valid server + * + * If the conditions above are met, sys_umcg_ctl() will change state to + * UMCG_TASK_NONE, and for workers, wake either next or server. + * + * Return: + * 0 - success + * -EFAULT - failed to read @self + * -EINVAL - some other error occurred + * -ESRCH - no such server_tid + */ +SYSCALL_DEFINE3(umcg_ctl, u32, flags, struct umcg_task __user *, self, clockid_t, which_clock) +{ + int cmd = flags & UMCG_CTL_CMD; + + if ((unsigned long)self % UMCG_TASK_ALIGN) + return -EINVAL; + + flags &= ~UMCG_CTL_CMD; + + if (flags & ~(UMCG_CTL_WORKER)) + return -EINVAL; + + switch (cmd) { + case UMCG_CTL_REGISTER: + return umcg_register(self, flags, which_clock); + + case UMCG_CTL_UNREGISTER: + return umcg_unregister(self, flags); + + default: + break; + } + + return -EINVAL; +} --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -273,6 +273,11 @@ COND_SYSCALL(landlock_create_ruleset); COND_SYSCALL(landlock_add_rule); COND_SYSCALL(landlock_restrict_self); +/* kernel/sched/umcg.c */ +COND_SYSCALL(umcg_ctl); +COND_SYSCALL(umcg_wait); +COND_SYSCALL(umcg_kick); + /* arch/example/kernel/sys_example.c */ /* mm/fadvise.c */