From patchwork Mon Nov 5 16:55:46 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668745 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4DCA514BD for ; Mon, 5 Nov 2018 17:00:04 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3B90129BA0 for ; Mon, 5 Nov 2018 17:00:04 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 2D2B629BAC; Mon, 5 Nov 2018 17:00:04 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 69E6229BA0 for ; Mon, 5 Nov 2018 17:00:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730031AbeKFCRT (ORCPT ); Mon, 5 Nov 2018 21:17:19 -0500 Received: from userp2120.oracle.com ([156.151.31.85]:43386 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729762AbeKFCRS (ORCPT ); Mon, 5 Nov 2018 21:17:18 -0500 Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5Gs0Tu052185; Mon, 5 Nov 2018 16:56:08 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2018-07-02; bh=t9jwkG1BYJOxTYtU5EaC5tIsAbpGvcEK3yRNGKu1E6U=; b=bN3JF6+GW0PFh6Y065J3n2fGZaRjdcnyjI6VvfGvsaa3I+bp5wvXVtrVrX4y5gRMQFpq lUFjLJuzx0o9TyJjxhYIlBRvx2DaWrz8nceeixHyBsh0AbkPE21rRGIuPm2RGOvc/qne KriZqMD4lNlYwbOoAwBXUz/m55x+oBJNivyhaJ+7Y0mEAWJVG68pm+3ZdwDjyfo+3f5u wQ9kt5J8KW5nDpAySUcSTVHybtSno3a+KWEY00yy+K7xYu+YsUjtLw/fUnmFkmMipnw1 /duq2s61UicOe6PPegigVrbKy272Aed85Y2CTrSKIClXT/CUScsvKpbjob//nCmtAxfc Ig== Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by userp2120.oracle.com with ESMTP id 2nh4aqg2c5-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:08 +0000 Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id wA5Gu8KK004207 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:08 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id wA5Gu6eO008286; Mon, 5 Nov 2018 16:56:06 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:06 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 01/13] ktask: add documentation Date: Mon, 5 Nov 2018 11:55:46 -0500 Message-Id: <20181105165558.11698-2-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Motivates and explains the ktask API for kernel clients. Signed-off-by: Daniel Jordan --- Documentation/core-api/index.rst | 1 + Documentation/core-api/ktask.rst | 213 +++++++++++++++++++++++++++++++ 2 files changed, 214 insertions(+) create mode 100644 Documentation/core-api/ktask.rst diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index 3adee82be311..c143a280a5b1 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -18,6 +18,7 @@ Core utilities refcount-vs-atomic cpu_hotplug idr + ktask local_ops workqueue genericirq diff --git a/Documentation/core-api/ktask.rst b/Documentation/core-api/ktask.rst new file mode 100644 index 000000000000..c3c00e1f802f --- /dev/null +++ b/Documentation/core-api/ktask.rst @@ -0,0 +1,213 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +============================================ +ktask: parallelize CPU-intensive kernel work +============================================ + +:Date: November, 2018 +:Author: Daniel Jordan + + +Introduction +============ + +ktask is a generic framework for parallelizing CPU-intensive work in the +kernel. The intended use is for big machines that can use their CPU power to +speed up large tasks that can't otherwise be multithreaded in userland. The +API is generic enough to add concurrency to many different kinds of tasks--for +example, page clearing over an address range or freeing a list of pages--and +aims to save its clients the trouble of splitting up the work, choosing the +number of helper threads to use, maintaining an efficient concurrency level, +starting these threads, and load balancing the work between them. + + +Motivation +========== + +A single CPU can spend an excessive amount of time in the kernel operating on +large amounts of data. Often these situations arise during initialization- and +destruction-related tasks, where the data involved scales with system size. +These long-running jobs can slow startup and shutdown of applications and the +system itself while extra CPUs sit idle. + +To ensure that applications and the kernel continue to perform well as core +counts and memory sizes increase, the kernel harnesses these idle CPUs to +complete such jobs more quickly. + +For example, when booting a large NUMA machine, ktask uses additional CPUs that +would otherwise be idle until the machine is fully up to avoid a needless +bottleneck during system boot and allow the kernel to take advantage of unused +memory bandwidth. Similarly, when starting a large VM using VFIO, ktask takes +advantage of the VM's idle CPUs during VFIO page pinning rather than have the +VM's boot blocked on one thread doing all the work. + +ktask is not a substitute for single-threaded optimization. However, there is +a point where a single CPU hits a wall despite performance tuning, so +parallelize! + + +Concept +======= + +ktask is built on unbound workqueues to take advantage of the thread management +facilities it provides: creation, destruction, flushing, priority setting, and +NUMA affinity. + +A little terminology up front: A 'task' is the total work there is to do and a +'chunk' is a unit of work given to a thread. + +To complete a task using the ktask framework, a client provides a thread +function that is responsible for completing one chunk. The thread function is +defined in a standard way, with start and end arguments that delimit the chunk +as well as an argument that the client uses to pass data specific to the task. + +In addition, the client supplies an object representing the start of the task +and an iterator function that knows how to advance some number of units in the +task to yield another object representing the new task position. The framework +uses the start object and iterator internally to divide the task into chunks. + +Finally, the client passes the total task size and a minimum chunk size to +indicate the minimum amount of work that's appropriate to do in one chunk. The +sizes are given in task-specific units (e.g. pages, inodes, bytes). The +framework uses these sizes, along with the number of online CPUs and an +internal maximum number of threads, to decide how many threads to start and how +many chunks to divide the task into. + +For example, consider the task of clearing a gigantic page. This used to be +done in a single thread with a for loop that calls a page clearing function for +each constituent base page. To parallelize with ktask, the client first moves +the for loop to the thread function, adapting it to operate on the range passed +to the function. In this simple case, the thread function's start and end +arguments are just addresses delimiting the portion of the gigantic page to +clear. Then, where the for loop used to be, the client calls into ktask with +the start address of the gigantic page, the total size of the gigantic page, +and the thread function. Internally, ktask will divide the address range into +an appropriate number of chunks and start an appropriate number of threads to +complete these chunks. + + +Configuration +============= + +To use ktask, configure the kernel with CONFIG_KTASK=y. + +If CONFIG_KTASK=n, calls to the ktask API are simply #define'd to run the +thread function that the client provides so that the task is completed without +concurrency in the current thread. + + +Interface +========= + +.. kernel-doc:: include/linux/ktask.h + + +Resource Limits +=============== + +ktask has resource limits on the number of work items it sends to workqueue. +In ktask, a workqueue item is a thread that runs chunks of the task until the +task is finished. + +These limits support the different ways ktask uses workqueues: + - ktask_run to run threads on the calling thread's node. + - ktask_run_numa to run threads on the node(s) specified. + - ktask_run_numa with nid=NUMA_NO_NODE to run threads on any node in the + system. + +To support these different ways of queueing work while maintaining an efficient +concurrency level, we need both system-wide and per-node limits on the number +of threads. Without per-node limits, a node might become oversubscribed +despite ktask staying within the system-wide limit, and without a system-wide +limit, we can't properly account for work that can run on any node. + +The system-wide limit is based on the total number of CPUs, and the per-node +limit on the CPU count for each node. A per-node work item counts against the +system-wide limit. Workqueue's max_active can't accommodate both types of +limit, no matter how many workqueues are used, so ktask implements its own. + +If a per-node limit is reached, the work item is allowed to run anywhere on the +machine to avoid overwhelming the node. If the global limit is also reached, +ktask won't queue additional work items until we fall below the limit again. + +These limits apply only to workqueue items--that is, helper threads beyond the +one starting the task. That way, one thread per task is always allowed to run. + + +Scheduler Interaction +===================== + +Even within the resource limits, ktask must take care to run a number of +threads appropriate for the system's current CPU load. Under high CPU usage, +starting excessive helper threads may disturb other tasks, unfairly taking CPU +time away from them for the sake of an optimized kernel code path. + +ktask plays nicely in this case by setting helper threads to the lowest +scheduling priority on the system (MAX_NICE). This way, helpers' CPU time is +appropriately throttled on a busy system and other tasks are not disturbed. + +The main thread initiating the task remains at its original priority so that it +still makes progress on a busy system. + +It is possible for a helper thread to start running and then be forced off-CPU +by a higher priority thread. With the helper's CPU time curtailed by MAX_NICE, +the main thread may wait longer for the task to finish than it would have had +it not started any helpers, so to ensure forward progress at a single-threaded +pace, once the main thread is finished with all outstanding work in the task, +the main thread wills its priority to one helper thread at a time. At least +one thread will then always be running at the priority of the calling thread. + + +Cgroup Awareness +================ + +Given the potentially large amount of CPU time ktask threads may consume, they +should be aware of the cgroup of the task that called into ktask and +appropriately throttled. + +TODO: Implement cgroup-awareness in unbound workqueues. + + +Power Management +================ + +Starting additional helper threads may cause the system to consume more energy, +which is undesirable on energy-conscious devices. Therefore ktask needs to be +aware of cpufreq policies and scaling governors. + +If an energy-conscious policy is in use (e.g. powersave, conservative) on any +part of the system, that is a signal that the user has strong power management +preferences, in which case ktask is disabled. + +TODO: Implement this. + + +Backward Compatibility +====================== + +ktask is written so that existing calls to the API will be backwards compatible +should the API gain new features in the future. This is accomplished by +restricting API changes to members of struct ktask_ctl and having clients make +an opaque initialization call (DEFINE_KTASK_CTL). This initialization can then +be modified to include any new arguments so that existing call sites stay the +same. + + +Error Handling +============== + +Calls to ktask fail only if the provided thread function fails. In particular, +ktask avoids allocating memory internally during a task, so it's safe to use in +sensitive contexts. + +Tasks can fail midway through their work. To recover, the finished chunks of +work need to be undone in a task-specific way, so ktask allows clients to pass +an "undo" callback that is responsible for undoing one chunk of work. To avoid +multiple levels of error handling, this "undo" callback should not be allowed +to fail. For simplicity and because it's a slow path, undoing is not +multithreaded. + +Each call to ktask_run and ktask_run_numa returns a single value, +KTASK_RETURN_SUCCESS or a client-specific value. Since threads can fail for +different reasons, however, ktask may need the ability to return +thread-specific error information. This can be added later if needed. From patchwork Mon Nov 5 16:55:47 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668707 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DC77314E2 for ; Mon, 5 Nov 2018 16:58:14 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C935728A75 for ; Mon, 5 Nov 2018 16:58:14 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id BD22029463; Mon, 5 Nov 2018 16:58:14 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 468FC28A75 for ; Mon, 5 Nov 2018 16:58:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387942AbeKFCSb (ORCPT ); Mon, 5 Nov 2018 21:18:31 -0500 Received: from aserp2120.oracle.com ([141.146.126.78]:50754 "EHLO aserp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387929AbeKFCSa (ORCPT ); Mon, 5 Nov 2018 21:18:30 -0500 Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1]) by aserp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5Gs1Mb104349; Mon, 5 Nov 2018 16:56:10 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2018-07-02; bh=oWHxOYIT1CBQcNkQVPyZAD4fnb03uxTtVNnbd4h3cFI=; b=n1Xn/FhkgAZe6vG56pzyNb3VfIWtGTWRfnpYCgG4r5wFcmlkkvPOsBvOFPyH76KhMnZD JsBPg6PwvVB8sfd/3qpFQLYrn465nJFd4XiSGEfUZ2bk+YNuyvwMEhlDy7adK/spdfuL 16Me+vaoHH9JRgNE5ISFpA5bPFqmd42nYK7e6qXGI6LYBj6b4Ybf/HE8Cr5Pf15sz4tM /svgwXPFb51qpxDH+Lv0Q6x7Q9/G7Ag7OCrBLqxYBOr1msUoam3ih9WMdwaofMzPKzl/ zoWbB5iB+fIkac46iR22gzY0zrPaODmclw83jDfaZwHTIhOfQxoy/7oe/JbAf23UPgx/ 2w== Received: from userv0022.oracle.com (userv0022.oracle.com [156.151.31.74]) by aserp2120.oracle.com with ESMTP id 2nh3mpg57w-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:10 +0000 Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236]) by userv0022.oracle.com (8.14.4/8.14.4) with ESMTP id wA5Gu8wh024398 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:09 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id wA5Gu80K017728; Mon, 5 Nov 2018 16:56:08 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:08 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 02/13] ktask: multithread CPU-intensive kernel work Date: Mon, 5 Nov 2018 11:55:47 -0500 Message-Id: <20181105165558.11698-3-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=2 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP A single CPU can spend an excessive amount of time in the kernel operating on large amounts of data. Often these situations arise during initialization- and destruction-related tasks, where the data involved scales with system size. These long-running jobs can slow startup and shutdown of applications and the system itself while extra CPUs sit idle. To ensure that applications and the kernel continue to perform well as core counts and memory sizes increase, harness these idle CPUs to complete such jobs more quickly. ktask is a generic framework for parallelizing CPU-intensive work in the kernel. The API is generic enough to add concurrency to many different kinds of tasks--for example, zeroing a range of pages or evicting a list of inodes--and aims to save its clients the trouble of splitting up the work, choosing the number of threads to use, maintaining an efficient concurrency level, starting these threads, and load balancing the work between them. The Documentation patch earlier in this series, from which the above was swiped, has more background. Inspired by work from Pavel Tatashin, Steve Sistare, and Jonathan Adams. Signed-off-by: Daniel Jordan Suggested-by: Pavel Tatashin Suggested-by: Steve Sistare Suggested-by: Jonathan Adams --- include/linux/ktask.h | 237 +++++++++++++++++++ init/Kconfig | 11 + init/main.c | 2 + kernel/Makefile | 2 +- kernel/ktask.c | 526 ++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 777 insertions(+), 1 deletion(-) create mode 100644 include/linux/ktask.h create mode 100644 kernel/ktask.c diff --git a/include/linux/ktask.h b/include/linux/ktask.h new file mode 100644 index 000000000000..9c75a93b51b9 --- /dev/null +++ b/include/linux/ktask.h @@ -0,0 +1,237 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * ktask.h - framework to parallelize CPU-intensive kernel work + * + * For more information, see Documentation/core-api/ktask.rst. + * + * Copyright (c) 2018 Oracle Corporation + * Author: Daniel Jordan + */ +#ifndef _LINUX_KTASK_H +#define _LINUX_KTASK_H + +#include +#include + +#define KTASK_RETURN_SUCCESS 0 + +/** + * struct ktask_node - Holds per-NUMA-node information about a task. + * + * @kn_start: An object that describes the start of the task on this NUMA node. + * @kn_task_size: size of this node's work (units are task-specific) + * @kn_nid: NUMA node id to run threads on + */ +struct ktask_node { + void *kn_start; + size_t kn_task_size; + int kn_nid; +}; + +/** + * typedef ktask_thread_func + * + * Called on each chunk of work that a ktask thread does. A thread may call + * this multiple times during one task. + * + * @start: An object that describes the start of the chunk. + * @end: An object that describes the end of the chunk. + * @arg: The thread function argument (provided with struct ktask_ctl). + * + * RETURNS: + * KTASK_RETURN_SUCCESS or a client-specific nonzero error code. + */ +typedef int (*ktask_thread_func)(void *start, void *end, void *arg); + +/** + * typedef ktask_iter_func + * + * An iterator function that advances the position by size units. + * + * @position: An object that describes the current position in the task. + * @size: The amount to advance in the task (in task-specific units). + * + * RETURNS: + * An object representing the new position. + */ +typedef void *(*ktask_iter_func)(void *position, size_t size); + +/** + * ktask_iter_range + * + * An iterator function for a contiguous range such as an array or address + * range. This is the default iterator; clients may override with + * ktask_ctl_set_iter_func. + * + * @position: An object that describes the current position in the task. + * Interpreted as an unsigned long. + * @size: The amount to advance in the task (in task-specific units). + * + * RETURNS: + * (position + size) + */ +void *ktask_iter_range(void *position, size_t size); + +/** + * struct ktask_ctl - Client-provided per-task control information. + * + * @kc_thread_func: A thread function that completes one chunk of the task per + * call. + * @kc_func_arg: An argument to be passed to the thread and undo functions. + * @kc_iter_func: An iterator function to advance the iterator by some number + * of task-specific units. + * @kc_min_chunk_size: The minimum chunk size in task-specific units. This + * allows the client to communicate the minimum amount of + * work that's appropriate for one worker thread to do at + * once. + * @kc_max_threads: max threads to use for the task, actual number may be less + * depending on CPU count, task size, and minimum chunk size. + */ +struct ktask_ctl { + /* Required arguments set with DEFINE_KTASK_CTL. */ + ktask_thread_func kc_thread_func; + void *kc_func_arg; + size_t kc_min_chunk_size; + + /* Optional, can set with ktask_ctl_set_*. Defaults on the right. */ + ktask_iter_func kc_iter_func; /* ktask_iter_range */ + size_t kc_max_threads; /* 0 (uses internal limit) */ +}; + +#define KTASK_CTL_INITIALIZER(thread_func, func_arg, min_chunk_size) \ + { \ + .kc_thread_func = (ktask_thread_func)(thread_func), \ + .kc_func_arg = (func_arg), \ + .kc_min_chunk_size = (min_chunk_size), \ + .kc_iter_func = (ktask_iter_range), \ + .kc_max_threads = 0, \ + } + +/* + * KTASK_CTL_INITIALIZER casts 'thread_func' to be of type ktask_thread_func so + * clients can write cleaner thread functions by relieving them of the need to + * cast the three void * arguments. Clients can just use the actual argument + * types instead. + */ +#define DEFINE_KTASK_CTL(ctl_name, thread_func, func_arg, min_chunk_size) \ + struct ktask_ctl ctl_name = \ + KTASK_CTL_INITIALIZER(thread_func, func_arg, min_chunk_size) \ + +/** + * ktask_ctl_set_iter_func - Set a task-specific iterator + * + * Overrides the default iterator, ktask_iter_range. + * + * Casts the type of the iterator function so its arguments can be + * client-specific (see the comment above DEFINE_KTASK_CTL). + * + * @ctl: A control structure containing information about the task. + * @iter_func: Walks a given number of units forward in the task, returning + * an iterator corresponding to the new position. + */ +#define ktask_ctl_set_iter_func(ctl, iter_func) \ + ((ctl)->kc_iter_func = (ktask_iter_func)(iter_func)) + +/** + * ktask_ctl_set_max_threads - Set a task-specific maximum number of threads + * + * This overrides the default maximum, which is KTASK_DEFAULT_MAX_THREADS. + * + * @ctl: A control structure containing information about the task. + * @max_threads: The maximum number of threads to be started for this task. + * The actual number of threads may be less than this. + */ +static inline void ktask_ctl_set_max_threads(struct ktask_ctl *ctl, + size_t max_threads) +{ + ctl->kc_max_threads = max_threads; +} + +/* + * The minimum chunk sizes for tasks that operate on ranges of memory. For + * now, say 128M. + */ +#define KTASK_MEM_CHUNK (1ul << 27) +#define KTASK_PTE_MINCHUNK (KTASK_MEM_CHUNK / PAGE_SIZE) +#define KTASK_PMD_MINCHUNK (KTASK_MEM_CHUNK / PMD_SIZE) + +#ifdef CONFIG_KTASK + +/** + * ktask_run - Runs one task. + * + * Starts threads to complete one task with the given thread function. Waits + * for the task to finish before returning. + * + * On a NUMA system, threads run on the current node. This is designed to + * mirror other parts of the kernel that favor locality, such as the default + * memory policy of allocating pages from the same node as the calling thread. + * ktask_run_numa may be used to get more control over where threads run. + * + * @start: An object that describes the start of the task. The client thread + * function interprets the object however it sees fit (e.g. an array + * index, a simple pointer, or a pointer to a more complicated + * representation of job position). + * @task_size: The size of the task (units are task-specific). + * @ctl: A control structure containing information about the task, including + * the client thread function. + * + * RETURNS: + * KTASK_RETURN_SUCCESS or a client-specific nonzero error code. + */ +int ktask_run(void *start, size_t task_size, struct ktask_ctl *ctl); + +/** + * ktask_run_numa - Runs one task while accounting for NUMA locality. + * + * Starts threads on the requested nodes to complete one task with the given + * thread function. The client is responsible for organizing the work along + * NUMA boundaries in the 'nodes' array. Waits for the task to finish before + * returning. + * + * In the special case of NUMA_NO_NODE, threads are allowed to run on any node. + * This is distinct from ktask_run, which runs threads on the current node. + * + * @nodes: An array of nodes. + * @nr_nodes: Length of the 'nodes' array. + * @ctl: Control structure containing information about the task. + * + * RETURNS: + * KTASK_RETURN_SUCCESS or a client-specific nonzero error code. + */ +int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes, + struct ktask_ctl *ctl); + +void ktask_init(void); + +#else /* CONFIG_KTASK */ + +static inline int ktask_run(void *start, size_t task_size, + struct ktask_ctl *ctl) +{ + return ctl->kc_thread_func(start, ctl->kc_iter_func(start, task_size), + ctl->kc_func_arg); +} + +static inline int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes, + struct ktask_ctl *ctl) +{ + size_t i; + int err = KTASK_RETURN_SUCCESS; + + for (i = 0; i < nr_nodes; ++i) { + void *start = nodes[i].kn_start; + void *end = ctl->kc_iter_func(start, nodes[i].kn_task_size); + + err = ctl->kc_thread_func(start, end, ctl->kc_func_arg); + if (err != KTASK_RETURN_SUCCESS) + break; + } + + return err; +} + +static inline void ktask_init(void) { } + +#endif /* CONFIG_KTASK */ +#endif /* _LINUX_KTASK_H */ diff --git a/init/Kconfig b/init/Kconfig index 41583f468cb4..ed82f76ed0b7 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -346,6 +346,17 @@ config AUDIT_TREE depends on AUDITSYSCALL select FSNOTIFY +config KTASK + bool "Multithread CPU-intensive kernel work" + depends on SMP + default y + help + Parallelize CPU-intensive kernel work. This feature is designed for + big machines that can take advantage of their extra CPUs to speed up + large kernel tasks. When enabled, kworker threads may occupy more + CPU time during these kernel tasks, but these threads are throttled + when other tasks on the system need CPU time. + source "kernel/irq/Kconfig" source "kernel/time/Kconfig" source "kernel/Kconfig.preempt" diff --git a/init/main.c b/init/main.c index ee147103ba1b..c689f00eab95 100644 --- a/init/main.c +++ b/init/main.c @@ -92,6 +92,7 @@ #include #include #include +#include #include #include @@ -1145,6 +1146,7 @@ static noinline void __init kernel_init_freeable(void) smp_init(); sched_init_smp(); + ktask_init(); page_alloc_init_late(); diff --git a/kernel/Makefile b/kernel/Makefile index 63b643eb7e70..ce238cb7add5 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -10,7 +10,7 @@ obj-y = fork.o exec_domain.o panic.o \ extable.o params.o \ kthread.o sys_ni.o nsproxy.o \ notifier.o ksysfs.o cred.o reboot.o \ - async.o range.o smpboot.o ucount.o + async.o range.o smpboot.o ucount.o ktask.o obj-$(CONFIG_MODULES) += kmod.o obj-$(CONFIG_MULTIUSER) += groups.o diff --git a/kernel/ktask.c b/kernel/ktask.c new file mode 100644 index 000000000000..a7b2b5a62737 --- /dev/null +++ b/kernel/ktask.c @@ -0,0 +1,526 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * ktask.c - framework to parallelize CPU-intensive kernel work + * + * For more information, see Documentation/core-api/ktask.rst. + * + * Copyright (c) 2018 Oracle Corporation + * Author: Daniel Jordan + */ + +#define pr_fmt(fmt) "ktask: " fmt + +#include + +#ifdef CONFIG_KTASK + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* Resource limits on the amount of workqueue items queued through ktask. */ +static DEFINE_SPINLOCK(ktask_rlim_lock); +/* Work items queued on all nodes (includes NUMA_NO_NODE) */ +static size_t ktask_rlim_cur; +static size_t ktask_rlim_max; +/* Work items queued per node */ +static size_t *ktask_rlim_node_cur; +static size_t *ktask_rlim_node_max; + +/* Allow only 80% of the cpus to be running additional ktask threads. */ +#define KTASK_CPUFRAC_NUMER 4 +#define KTASK_CPUFRAC_DENOM 5 + +/* Used to pass ktask data to the workqueue API. */ +struct ktask_work { + struct work_struct kw_work; + struct ktask_task *kw_task; + int kw_ktask_node_i; + int kw_queue_nid; + struct list_head kw_list; /* ktask_free_works linkage */ +}; + +static LIST_HEAD(ktask_free_works); +static struct ktask_work *ktask_works; + +/* Represents one task. This is for internal use only. */ +struct ktask_task { + struct ktask_ctl kt_ctl; + size_t kt_total_size; + size_t kt_chunk_size; + /* protects this struct and struct ktask_work's of a running task */ + struct mutex kt_mutex; + struct ktask_node *kt_nodes; + size_t kt_nr_nodes; + size_t kt_nr_nodes_left; + size_t kt_nworks; + size_t kt_nworks_fini; + int kt_error; /* first error from thread_func */ + struct completion kt_ktask_done; +}; + +/* + * Shrink the size of each job by this shift amount to load balance between the + * worker threads. + */ +#define KTASK_LOAD_BAL_SHIFT 2 + +#define KTASK_DEFAULT_MAX_THREADS 4 + +/* Maximum number of threads for a single task. */ +int ktask_max_threads = KTASK_DEFAULT_MAX_THREADS; + +static struct workqueue_struct *ktask_wq; +static struct workqueue_struct *ktask_nonuma_wq; + +static void ktask_thread(struct work_struct *work); + +static void ktask_init_work(struct ktask_work *kw, struct ktask_task *kt, + size_t ktask_node_i, size_t queue_nid) +{ + INIT_WORK(&kw->kw_work, ktask_thread); + kw->kw_task = kt; + kw->kw_ktask_node_i = ktask_node_i; + kw->kw_queue_nid = queue_nid; +} + +static void ktask_queue_work(struct ktask_work *kw) +{ + struct workqueue_struct *wq; + int cpu; + + if (kw->kw_queue_nid == NUMA_NO_NODE) { + /* + * If no node is specified, use ktask_nonuma_wq to + * allow the thread to run on any node, but fall back + * to ktask_wq if we couldn't allocate ktask_nonuma_wq. + */ + cpu = WORK_CPU_UNBOUND; + wq = (ktask_nonuma_wq) ?: ktask_wq; + } else { + /* + * WQ_UNBOUND workqueues, such as the one ktask uses, + * execute work on some CPU from the node of the CPU we + * pass to queue_work_on, so just pick any CPU to stand + * for the node on NUMA systems. + * + * On non-NUMA systems, cpumask_of_node becomes + * cpu_online_mask. + */ + cpu = cpumask_any(cpumask_of_node(kw->kw_queue_nid)); + wq = ktask_wq; + } + + WARN_ON(!queue_work_on(cpu, wq, &kw->kw_work)); +} + +/* Returns true if we're migrating this part of the task to another node. */ +static bool ktask_node_migrate(struct ktask_node *old_kn, struct ktask_node *kn, + size_t ktask_node_i, struct ktask_work *kw, + struct ktask_task *kt) +{ + int new_queue_nid; + + /* + * Don't migrate a user thread, otherwise migrate only if we're going + * to a different node. + */ + if (!IS_ENABLED(CONFIG_NUMA) || !(current->flags & PF_KTHREAD) || + kn->kn_nid == old_kn->kn_nid || num_online_nodes() == 1) + return false; + + /* Adjust resource limits. */ + spin_lock(&ktask_rlim_lock); + if (kw->kw_queue_nid != NUMA_NO_NODE) + --ktask_rlim_node_cur[kw->kw_queue_nid]; + + if (kn->kn_nid != NUMA_NO_NODE && + ktask_rlim_node_cur[kw->kw_queue_nid] < + ktask_rlim_node_max[kw->kw_queue_nid]) { + new_queue_nid = kn->kn_nid; + ++ktask_rlim_node_cur[new_queue_nid]; + } else { + new_queue_nid = NUMA_NO_NODE; + } + spin_unlock(&ktask_rlim_lock); + + ktask_init_work(kw, kt, ktask_node_i, new_queue_nid); + ktask_queue_work(kw); + + return true; +} + +static void ktask_thread(struct work_struct *work) +{ + struct ktask_work *kw = container_of(work, struct ktask_work, kw_work); + struct ktask_task *kt = kw->kw_task; + struct ktask_ctl *kc = &kt->kt_ctl; + struct ktask_node *kn = &kt->kt_nodes[kw->kw_ktask_node_i]; + bool done; + + mutex_lock(&kt->kt_mutex); + + while (kt->kt_total_size > 0 && kt->kt_error == KTASK_RETURN_SUCCESS) { + void *start, *end; + size_t size; + int ret; + + if (kn->kn_task_size == 0) { + /* The current node is out of work; pick a new one. */ + size_t remaining_nodes_seen = 0; + size_t new_idx = prandom_u32_max(kt->kt_nr_nodes_left); + struct ktask_node *old_kn; + size_t i; + + WARN_ON(kt->kt_nr_nodes_left == 0); + WARN_ON(new_idx >= kt->kt_nr_nodes_left); + for (i = 0; i < kt->kt_nr_nodes; ++i) { + if (kt->kt_nodes[i].kn_task_size == 0) + continue; + + if (remaining_nodes_seen >= new_idx) + break; + + ++remaining_nodes_seen; + } + /* We should have found work on another node. */ + WARN_ON(i >= kt->kt_nr_nodes); + + old_kn = kn; + kn = &kt->kt_nodes[i]; + + /* Start another worker on the node we've chosen. */ + if (ktask_node_migrate(old_kn, kn, i, kw, kt)) { + mutex_unlock(&kt->kt_mutex); + return; + } + } + + start = kn->kn_start; + size = min(kt->kt_chunk_size, kn->kn_task_size); + end = kc->kc_iter_func(start, size); + kn->kn_start = end; + kn->kn_task_size -= size; + WARN_ON(kt->kt_total_size < size); + kt->kt_total_size -= size; + if (kn->kn_task_size == 0) { + WARN_ON(kt->kt_nr_nodes_left == 0); + kt->kt_nr_nodes_left--; + } + + mutex_unlock(&kt->kt_mutex); + + ret = kc->kc_thread_func(start, end, kc->kc_func_arg); + + mutex_lock(&kt->kt_mutex); + + /* Save first error code only. */ + if (kt->kt_error == KTASK_RETURN_SUCCESS && ret != kt->kt_error) + kt->kt_error = ret; + } + + WARN_ON(kt->kt_nr_nodes_left > 0 && + kt->kt_error == KTASK_RETURN_SUCCESS); + + ++kt->kt_nworks_fini; + WARN_ON(kt->kt_nworks_fini > kt->kt_nworks); + done = (kt->kt_nworks_fini == kt->kt_nworks); + mutex_unlock(&kt->kt_mutex); + + if (done) + complete(&kt->kt_ktask_done); +} + +/* + * Returns the number of chunks to break this task into. + * + * The number of chunks will be at least the number of works, but in the common + * case of a large task, the number of chunks will be greater to load balance + * between the workqueue threads in case some of them finish more quickly than + * others. + */ +static size_t ktask_chunk_size(size_t task_size, size_t min_chunk_size, + size_t nworks) +{ + size_t chunk_size; + + if (nworks == 1) + return task_size; + + chunk_size = (task_size / nworks) >> KTASK_LOAD_BAL_SHIFT; + + /* + * chunk_size should be a multiple of min_chunk_size for tasks that + * need to operate in fixed-size batches. + */ + if (chunk_size > min_chunk_size) + chunk_size = rounddown(chunk_size, min_chunk_size); + + return max(chunk_size, min_chunk_size); +} + +/* + * Returns the number of works to be used in the task. This number includes + * the current thread, so a return value of 1 means no extra threads are + * started. + */ +static size_t ktask_init_works(struct ktask_node *nodes, size_t nr_nodes, + struct ktask_task *kt, + struct list_head *works_list) +{ + size_t i, nr_works, nr_works_check; + size_t min_chunk_size = kt->kt_ctl.kc_min_chunk_size; + size_t max_threads = kt->kt_ctl.kc_max_threads; + + if (!ktask_wq) + return 1; + + if (max_threads == 0) + max_threads = ktask_max_threads; + + /* Ensure at least one thread when task_size < min_chunk_size. */ + nr_works_check = DIV_ROUND_UP(kt->kt_total_size, min_chunk_size); + nr_works_check = min_t(size_t, nr_works_check, num_online_cpus()); + nr_works_check = min_t(size_t, nr_works_check, max_threads); + + /* + * Use at least the current thread for this task; check whether + * ktask_rlim allows additional work items to be queued. + */ + nr_works = 1; + spin_lock(&ktask_rlim_lock); + for (i = nr_works; i < nr_works_check; ++i) { + /* Allocate works evenly over the task's given nodes. */ + size_t ktask_node_i = i % nr_nodes; + struct ktask_node *kn = &nodes[ktask_node_i]; + struct ktask_work *kw; + int nid = kn->kn_nid; + int queue_nid; + + WARN_ON(ktask_rlim_cur > ktask_rlim_max); + if (ktask_rlim_cur == ktask_rlim_max) + break; /* No more work items allowed to be queued. */ + + /* Allowed to queue on requested node? */ + if (nid != NUMA_NO_NODE && + ktask_rlim_node_cur[nid] < ktask_rlim_node_max[nid]) { + WARN_ON(ktask_rlim_node_cur[nid] > ktask_rlim_cur); + ++ktask_rlim_node_cur[nid]; + queue_nid = nid; + } else { + queue_nid = NUMA_NO_NODE; + } + + WARN_ON(list_empty(&ktask_free_works)); + kw = list_first_entry(&ktask_free_works, struct ktask_work, + kw_list); + list_move_tail(&kw->kw_list, works_list); + ktask_init_work(kw, kt, ktask_node_i, queue_nid); + + ++ktask_rlim_cur; + ++nr_works; + } + spin_unlock(&ktask_rlim_lock); + + return nr_works; +} + +static void ktask_fini_works(struct ktask_task *kt, + struct list_head *works_list) +{ + struct ktask_work *work; + + spin_lock(&ktask_rlim_lock); + + /* Put the works back on the free list, adjusting rlimits. */ + list_for_each_entry(work, works_list, kw_list) { + if (work->kw_queue_nid != NUMA_NO_NODE) { + WARN_ON(ktask_rlim_node_cur[work->kw_queue_nid] == 0); + --ktask_rlim_node_cur[work->kw_queue_nid]; + } + WARN_ON(ktask_rlim_cur == 0); + --ktask_rlim_cur; + } + list_splice(works_list, &ktask_free_works); + + spin_unlock(&ktask_rlim_lock); +} + +int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes, + struct ktask_ctl *ctl) +{ + size_t i; + struct ktask_work kw; + struct ktask_work *work; + LIST_HEAD(works_list); + struct ktask_task kt = { + .kt_ctl = *ctl, + .kt_total_size = 0, + .kt_nodes = nodes, + .kt_nr_nodes = nr_nodes, + .kt_nr_nodes_left = nr_nodes, + .kt_nworks_fini = 0, + .kt_error = KTASK_RETURN_SUCCESS, + }; + + for (i = 0; i < nr_nodes; ++i) { + kt.kt_total_size += nodes[i].kn_task_size; + if (nodes[i].kn_task_size == 0) + kt.kt_nr_nodes_left--; + + WARN_ON(nodes[i].kn_nid >= MAX_NUMNODES); + } + + if (kt.kt_total_size == 0) + return KTASK_RETURN_SUCCESS; + + mutex_init(&kt.kt_mutex); + init_completion(&kt.kt_ktask_done); + + kt.kt_nworks = ktask_init_works(nodes, nr_nodes, &kt, &works_list); + kt.kt_chunk_size = ktask_chunk_size(kt.kt_total_size, + ctl->kc_min_chunk_size, + kt.kt_nworks); + + list_for_each_entry(work, &works_list, kw_list) + ktask_queue_work(work); + + /* Use the current thread, which saves starting a workqueue worker. */ + ktask_init_work(&kw, &kt, 0, nodes[0].kn_nid); + ktask_thread(&kw.kw_work); + + /* Wait for all the jobs to finish. */ + wait_for_completion(&kt.kt_ktask_done); + + ktask_fini_works(&kt, &works_list); + mutex_destroy(&kt.kt_mutex); + + return kt.kt_error; +} +EXPORT_SYMBOL_GPL(ktask_run_numa); + +int ktask_run(void *start, size_t task_size, struct ktask_ctl *ctl) +{ + struct ktask_node node; + + node.kn_start = start; + node.kn_task_size = task_size; + node.kn_nid = numa_node_id(); + + return ktask_run_numa(&node, 1, ctl); +} +EXPORT_SYMBOL_GPL(ktask_run); + +/* + * Initialize internal limits on work items queued. Work items submitted to + * cmwq capped at 80% of online cpus both system-wide and per-node to maintain + * an efficient level of parallelization at these respective levels. + */ +static bool __init ktask_rlim_init(void) +{ + int node, nr_cpus; + unsigned int nr_node_cpus; + + nr_cpus = num_online_cpus(); + + /* XXX Handle CPU hotplug. */ + if (nr_cpus == 1) + return false; + + ktask_rlim_node_cur = kcalloc(num_possible_nodes(), sizeof(size_t), + GFP_KERNEL); + + ktask_rlim_node_max = kmalloc_array(num_possible_nodes(), + sizeof(size_t), GFP_KERNEL); + + ktask_rlim_max = mult_frac(nr_cpus, KTASK_CPUFRAC_NUMER, + KTASK_CPUFRAC_DENOM); + for_each_node(node) { + nr_node_cpus = cpumask_weight(cpumask_of_node(node)); + ktask_rlim_node_max[node] = mult_frac(nr_node_cpus, + KTASK_CPUFRAC_NUMER, + KTASK_CPUFRAC_DENOM); + } + + return true; +} + +void __init ktask_init(void) +{ + struct workqueue_attrs *attrs; + int i, ret; + + if (!ktask_rlim_init()) + goto out; + + ktask_works = kmalloc_array(ktask_rlim_max, sizeof(struct ktask_work), + GFP_KERNEL); + for (i = 0; i < ktask_rlim_max; ++i) + list_add_tail(&ktask_works[i].kw_list, &ktask_free_works); + + ktask_wq = alloc_workqueue("ktask_wq", WQ_UNBOUND, 0); + if (!ktask_wq) { + pr_warn("disabled (failed to alloc ktask_wq)"); + goto out; + } + + /* + * Threads executing work from this workqueue can run on any node on + * the system. If we get any failures below, use ktask_wq in its + * place. It's better than nothing. + */ + ktask_nonuma_wq = alloc_workqueue("ktask_nonuma_wq", WQ_UNBOUND, 0); + if (!ktask_nonuma_wq) { + pr_warn("disabled (failed to alloc ktask_nonuma_wq)"); + goto alloc_fail; + } + + attrs = alloc_workqueue_attrs(GFP_KERNEL); + if (!attrs) { + pr_warn("disabled (couldn't alloc wq attrs)"); + goto alloc_fail; + } + + attrs->no_numa = true; + + ret = apply_workqueue_attrs(ktask_nonuma_wq, attrs); + if (ret != 0) { + pr_warn("disabled (couldn't apply attrs to ktask_nonuma_wq)"); + goto apply_fail; + } + + free_workqueue_attrs(attrs); +out: + return; + +apply_fail: + free_workqueue_attrs(attrs); +alloc_fail: + if (ktask_wq) + destroy_workqueue(ktask_wq); + if (ktask_nonuma_wq) + destroy_workqueue(ktask_nonuma_wq); + ktask_wq = NULL; + ktask_nonuma_wq = NULL; +} + +#endif /* CONFIG_KTASK */ + +/* + * This function is defined outside CONFIG_KTASK so it can be called in the + * !CONFIG_KTASK versions of ktask_run and ktask_run_numa. + */ +void *ktask_iter_range(void *position, size_t size) +{ + return (char *)position + size; +} +EXPORT_SYMBOL_GPL(ktask_iter_range); From patchwork Mon Nov 5 16:55:48 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668737 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 625EC14BD for ; Mon, 5 Nov 2018 16:59:44 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4FDC929B8D for ; Mon, 5 Nov 2018 16:59:44 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 43C9829BA0; Mon, 5 Nov 2018 16:59:44 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 916A329B8D for ; Mon, 5 Nov 2018 16:59:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387625AbeKFCRX (ORCPT ); Mon, 5 Nov 2018 21:17:23 -0500 Received: from aserp2120.oracle.com ([141.146.126.78]:49520 "EHLO aserp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730073AbeKFCRW (ORCPT ); Mon, 5 Nov 2018 21:17:22 -0500 Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1]) by aserp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5Gs2dF104430; Mon, 5 Nov 2018 16:56:12 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2018-07-02; bh=r9EK74CGq12tLNaMapE/4PyK+rMYKf32lRC9GP5eoB4=; b=2BYNYTmHndcc9AtITIKvpRSM1QBsihQ9IWPChTb7cyKJB0Thb+tB3ma2zfr25u3qFQd+ kOBpr6MHXfz2ptLnvploitatlkhZ/jQ8eMmCoNXezIzt59/2z6LaLsmHCMVAOQBRnp+d Q+/90S3DN1WT5g6uYqKPpQkpUqJNaV5YuJE6bpt/RgyRPIveZtEV9eLIaCTyIvorkLOZ EpKur+8+QGDme/kOApS6CmjqCIUqerYjBtAdOXJSTlNsPZcB83O6U+qSs8/+GNI1mgDL nTO+EtjbuNHKPEuU8Uj2QKCw3owstK3YF/bvGbBcgdC0M8+rAPK6KA5fdXbpz/VrXa7k BA== Received: from aserv0021.oracle.com (aserv0021.oracle.com [141.146.126.233]) by aserp2120.oracle.com with ESMTP id 2nh3mpg584-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:12 +0000 Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by aserv0021.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuB96024422 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:11 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id wA5GuARw008323; Mon, 5 Nov 2018 16:56:10 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:10 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 03/13] ktask: add undo support Date: Mon, 5 Nov 2018 11:55:48 -0500 Message-Id: <20181105165558.11698-4-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=2 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Tasks can fail midway through their work. To recover, the finished chunks of work need to be undone in a task-specific way. Allow ktask clients to pass an "undo" callback that is responsible for undoing one chunk of work. To avoid multiple levels of error handling, do not allow the callback to fail. For simplicity and because it's a slow path, undoing is not multithreaded. Signed-off-by: Daniel Jordan --- include/linux/ktask.h | 36 +++++++++++- kernel/ktask.c | 125 +++++++++++++++++++++++++++++++++++------- 2 files changed, 138 insertions(+), 23 deletions(-) diff --git a/include/linux/ktask.h b/include/linux/ktask.h index 9c75a93b51b9..30a6a88e5dad 100644 --- a/include/linux/ktask.h +++ b/include/linux/ktask.h @@ -10,6 +10,7 @@ #ifndef _LINUX_KTASK_H #define _LINUX_KTASK_H +#include #include #include @@ -23,9 +24,14 @@ * @kn_nid: NUMA node id to run threads on */ struct ktask_node { - void *kn_start; - size_t kn_task_size; - int kn_nid; + void *kn_start; + size_t kn_task_size; + int kn_nid; + + /* Private fields below - do not touch these. */ + void *kn_position; + size_t kn_remaining_size; + struct list_head kn_failed_works; }; /** @@ -43,6 +49,14 @@ struct ktask_node { */ typedef int (*ktask_thread_func)(void *start, void *end, void *arg); +/** + * typedef ktask_undo_func + * + * The same as ktask_thread_func, with the exception that it must always + * succeed, so it doesn't return anything. + */ +typedef void (*ktask_undo_func)(void *start, void *end, void *arg); + /** * typedef ktask_iter_func * @@ -77,6 +91,11 @@ void *ktask_iter_range(void *position, size_t size); * * @kc_thread_func: A thread function that completes one chunk of the task per * call. + * @kc_undo_func: A function that undoes one chunk of the task per call. + * If non-NULL and error(s) occur during the task, this is + * called on all successfully completed chunks of work. The + * chunk(s) in which failure occurs should be handled in + * kc_thread_func. * @kc_func_arg: An argument to be passed to the thread and undo functions. * @kc_iter_func: An iterator function to advance the iterator by some number * of task-specific units. @@ -90,6 +109,7 @@ void *ktask_iter_range(void *position, size_t size); struct ktask_ctl { /* Required arguments set with DEFINE_KTASK_CTL. */ ktask_thread_func kc_thread_func; + ktask_undo_func kc_undo_func; void *kc_func_arg; size_t kc_min_chunk_size; @@ -101,6 +121,7 @@ struct ktask_ctl { #define KTASK_CTL_INITIALIZER(thread_func, func_arg, min_chunk_size) \ { \ .kc_thread_func = (ktask_thread_func)(thread_func), \ + .kc_undo_func = NULL, \ .kc_func_arg = (func_arg), \ .kc_min_chunk_size = (min_chunk_size), \ .kc_iter_func = (ktask_iter_range), \ @@ -132,6 +153,15 @@ struct ktask_ctl { #define ktask_ctl_set_iter_func(ctl, iter_func) \ ((ctl)->kc_iter_func = (ktask_iter_func)(iter_func)) +/** + * ktask_ctl_set_undo_func - Designate an undo function to unwind from error + * + * @ctl: A control structure containing information about the task. + * @undo_func: Undoes a piece of the task. + */ +#define ktask_ctl_set_undo_func(ctl, undo_func) \ + ((ctl)->kc_undo_func = (ktask_undo_func)(undo_func)) + /** * ktask_ctl_set_max_threads - Set a task-specific maximum number of threads * diff --git a/kernel/ktask.c b/kernel/ktask.c index a7b2b5a62737..b91c62f14dcd 100644 --- a/kernel/ktask.c +++ b/kernel/ktask.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -46,7 +47,12 @@ struct ktask_work { struct ktask_task *kw_task; int kw_ktask_node_i; int kw_queue_nid; - struct list_head kw_list; /* ktask_free_works linkage */ + /* task units from kn_start to kw_error_start */ + size_t kw_error_offset; + void *kw_error_start; + void *kw_error_end; + /* ktask_free_works, kn_failed_works linkage */ + struct list_head kw_list; }; static LIST_HEAD(ktask_free_works); @@ -170,11 +176,11 @@ static void ktask_thread(struct work_struct *work) mutex_lock(&kt->kt_mutex); while (kt->kt_total_size > 0 && kt->kt_error == KTASK_RETURN_SUCCESS) { - void *start, *end; - size_t size; + void *position, *end; + size_t size, position_offset; int ret; - if (kn->kn_task_size == 0) { + if (kn->kn_remaining_size == 0) { /* The current node is out of work; pick a new one. */ size_t remaining_nodes_seen = 0; size_t new_idx = prandom_u32_max(kt->kt_nr_nodes_left); @@ -184,7 +190,7 @@ static void ktask_thread(struct work_struct *work) WARN_ON(kt->kt_nr_nodes_left == 0); WARN_ON(new_idx >= kt->kt_nr_nodes_left); for (i = 0; i < kt->kt_nr_nodes; ++i) { - if (kt->kt_nodes[i].kn_task_size == 0) + if (kt->kt_nodes[i].kn_remaining_size == 0) continue; if (remaining_nodes_seen >= new_idx) @@ -205,27 +211,40 @@ static void ktask_thread(struct work_struct *work) } } - start = kn->kn_start; - size = min(kt->kt_chunk_size, kn->kn_task_size); - end = kc->kc_iter_func(start, size); - kn->kn_start = end; - kn->kn_task_size -= size; + position = kn->kn_position; + position_offset = kn->kn_task_size - kn->kn_remaining_size; + size = min(kt->kt_chunk_size, kn->kn_remaining_size); + end = kc->kc_iter_func(position, size); + kn->kn_position = end; + kn->kn_remaining_size -= size; WARN_ON(kt->kt_total_size < size); kt->kt_total_size -= size; - if (kn->kn_task_size == 0) { + if (kn->kn_remaining_size == 0) { WARN_ON(kt->kt_nr_nodes_left == 0); kt->kt_nr_nodes_left--; } mutex_unlock(&kt->kt_mutex); - ret = kc->kc_thread_func(start, end, kc->kc_func_arg); + ret = kc->kc_thread_func(position, end, kc->kc_func_arg); mutex_lock(&kt->kt_mutex); - /* Save first error code only. */ - if (kt->kt_error == KTASK_RETURN_SUCCESS && ret != kt->kt_error) - kt->kt_error = ret; + if (ret != KTASK_RETURN_SUCCESS) { + /* Save first error code only. */ + if (kt->kt_error == KTASK_RETURN_SUCCESS) + kt->kt_error = ret; + /* + * If this task has an undo function, save information + * about where this thread failed for ktask_undo. + */ + if (kc->kc_undo_func) { + list_move(&kw->kw_list, &kn->kn_failed_works); + kw->kw_error_start = position; + kw->kw_error_offset = position_offset; + kw->kw_error_end = end; + } + } } WARN_ON(kt->kt_nr_nodes_left > 0 && @@ -335,26 +354,85 @@ static size_t ktask_init_works(struct ktask_node *nodes, size_t nr_nodes, } static void ktask_fini_works(struct ktask_task *kt, + struct ktask_work *stack_work, struct list_head *works_list) { - struct ktask_work *work; + struct ktask_work *work, *next; spin_lock(&ktask_rlim_lock); /* Put the works back on the free list, adjusting rlimits. */ - list_for_each_entry(work, works_list, kw_list) { + list_for_each_entry_safe(work, next, works_list, kw_list) { + if (work == stack_work) { + /* On this thread's stack, so not subject to rlimits. */ + list_del(&work->kw_list); + continue; + } if (work->kw_queue_nid != NUMA_NO_NODE) { WARN_ON(ktask_rlim_node_cur[work->kw_queue_nid] == 0); --ktask_rlim_node_cur[work->kw_queue_nid]; } WARN_ON(ktask_rlim_cur == 0); --ktask_rlim_cur; + list_move(&work->kw_list, &ktask_free_works); } - list_splice(works_list, &ktask_free_works); - spin_unlock(&ktask_rlim_lock); } +static int ktask_error_cmp(void *unused, struct list_head *a, + struct list_head *b) +{ + struct ktask_work *work_a = list_entry(a, struct ktask_work, kw_list); + struct ktask_work *work_b = list_entry(b, struct ktask_work, kw_list); + + if (work_a->kw_error_offset < work_b->kw_error_offset) + return -1; + else if (work_a->kw_error_offset > work_b->kw_error_offset) + return 1; + return 0; +} + +static void ktask_undo(struct ktask_node *nodes, size_t nr_nodes, + struct ktask_ctl *ctl, struct list_head *works_list) +{ + size_t i; + + for (i = 0; i < nr_nodes; ++i) { + struct ktask_node *kn = &nodes[i]; + struct list_head *failed_works = &kn->kn_failed_works; + struct ktask_work *failed_work; + void *undo_pos = kn->kn_start; + void *undo_end; + + /* Sort so the failed ranges can be checked as we go. */ + list_sort(NULL, failed_works, ktask_error_cmp); + + /* Undo completed work on this node, skipping failed ranges. */ + while (undo_pos != kn->kn_position) { + failed_work = list_first_entry_or_null(failed_works, + struct ktask_work, + kw_list); + if (failed_work) + undo_end = failed_work->kw_error_start; + else + undo_end = kn->kn_position; + + if (undo_pos != undo_end) { + ctl->kc_undo_func(undo_pos, undo_end, + ctl->kc_func_arg); + } + + if (failed_work) { + undo_pos = failed_work->kw_error_end; + list_move(&failed_work->kw_list, works_list); + } else { + undo_pos = undo_end; + } + } + WARN_ON(!list_empty(failed_works)); + } +} + int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes, struct ktask_ctl *ctl) { @@ -374,6 +452,9 @@ int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes, for (i = 0; i < nr_nodes; ++i) { kt.kt_total_size += nodes[i].kn_task_size; + nodes[i].kn_position = nodes[i].kn_start; + nodes[i].kn_remaining_size = nodes[i].kn_task_size; + INIT_LIST_HEAD(&nodes[i].kn_failed_works); if (nodes[i].kn_task_size == 0) kt.kt_nr_nodes_left--; @@ -396,12 +477,16 @@ int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes, /* Use the current thread, which saves starting a workqueue worker. */ ktask_init_work(&kw, &kt, 0, nodes[0].kn_nid); + INIT_LIST_HEAD(&kw.kw_list); ktask_thread(&kw.kw_work); /* Wait for all the jobs to finish. */ wait_for_completion(&kt.kt_ktask_done); - ktask_fini_works(&kt, &works_list); + if (kt.kt_error && ctl->kc_undo_func) + ktask_undo(nodes, nr_nodes, ctl, &works_list); + + ktask_fini_works(&kt, &kw, &works_list); mutex_destroy(&kt.kt_mutex); return kt.kt_error; From patchwork Mon Nov 5 16:55:49 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668739 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 74E8614E2 for ; Mon, 5 Nov 2018 16:59:52 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6229229B8D for ; Mon, 5 Nov 2018 16:59:52 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 54C0829BA0; Mon, 5 Nov 2018 16:59:52 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E50DB29B8D for ; Mon, 5 Nov 2018 16:59:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387559AbeKFCRW (ORCPT ); Mon, 5 Nov 2018 21:17:22 -0500 Received: from userp2120.oracle.com ([156.151.31.85]:43414 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729762AbeKFCRU (ORCPT ); Mon, 5 Nov 2018 21:17:20 -0500 Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5Gs0k6052184; Mon, 5 Nov 2018 16:56:14 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2018-07-02; bh=gutsZD7DMWVODOwljz7ZZ5gCU9HyKt2gG0UQQPSwfmk=; b=EkKCZB+TXN9kcQy8IJsFm22o2vVRS07DBy+dXqp9LYMjvEMjD1joAEahNPTZWc9ZSOmz jLoBY+e+kYHrOU9hF9AQU0sD8MM7+F56gU6xaI1jjB5+wQay9w/jLNviOetya7zvPXch GCibVfJyxI31RjE3qiCendiAu8YQSsbDf/1kd+mxdw/7Yk6uo+gvyW3uKGOdZy9bGC+P f6AfG+cN7UIfR/20nHqSA6ODKq6Za82OnJo62+9DObtkPG1xFnY4PoV6J6fxWBtxFwsa K2tZ3GX/fPrk0jTT9cDkMVQmy+sUTCPp9lxtYzpdf/CQNjvcmGHo+CDYsaxjIX/IuFoV /A== Received: from aserv0022.oracle.com (aserv0022.oracle.com [141.146.126.234]) by userp2120.oracle.com with ESMTP id 2nh4aqg2cp-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:14 +0000 Received: from userv0122.oracle.com (userv0122.oracle.com [156.151.31.75]) by aserv0022.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuDIT022159 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:13 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by userv0122.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuCeZ010802; Mon, 5 Nov 2018 16:56:12 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:11 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 04/13] ktask: run helper threads at MAX_NICE Date: Mon, 5 Nov 2018 11:55:49 -0500 Message-Id: <20181105165558.11698-5-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Multithreading may speed long-running kernel tasks, but overly optimistic parallelization can go wrong if too many helper threads are started on an already-busy system. Such helpers can degrade the performance of other tasks, so they should be sensitive to current CPU utilization[1]. To achieve this, run helpers at MAX_NICE so that their CPU time is proportional to idle CPU time. The main thread that called into ktask naturally runs at its original priority so that it can make progress on a heavily loaded system, as it would if ktask were not in the picture. I tested two different cases in which a non-ktask and a ktask workload compete for the same CPUs with the goal of showing that normal priority (i.e. nice=0) ktask helpers cause the non-ktask workload to run more slowly, whereas MAX_NICE ktask helpers don't. Testing notes: - Each case was run using 8 CPUs on a large two-socket server, with a cpumask allowing all test threads to run anywhere within the 8. - The non-ktask workload used 7 threads and the ktask workload used 8 threads to evaluate how much ktask helpers, rather than the main ktask thread, disturbed the non-ktask workload. - The non-ktask workload was started after the ktask workload and run for less time to maximize the chances that the non-ktask workload would be disturbed. - Runtimes in seconds. Case 1: Synthetic, worst-case CPU contention ktask_test - a tight loop doing integer multiplication to max out on CPU; used for testing only, does not appear in this series stress-ng - cpu stressor ("-c --cpu-method ackerman --cpu-ops 1200"); stress-ng alone (stdev) max_nice (stdev) normal_prio (stdev) ------------------------------------------------------------ ktask_test 96.87 ( 1.09) 90.81 ( 0.29) stress-ng 43.04 ( 0.00) 43.58 ( 0.01) 75.86 ( 0.39) This case shows MAX_NICE helpers make a significant difference compared to normal priority helpers, with stress-ng taking 76% longer to finish when competing with normal priority ktask threads than when run by itself, but only 1% longer when run with MAX_NICE helpers. The 1% comes from the small amount of CPU time MAX_NICE threads are given despite their low priority. Case 2: Real-world CPU contention ktask_vfio - VFIO page pin a 175G kvm guest usemem - faults in 25G of anonymous THP per thread, PAGE_SIZE stride; used to mimic the page clearing that dominates in ktask_vfio so that usemem competes for the same system resources usemem alone (stdev) max_nice (stdev) normal_prio (stdev) ------------------------------------------------------------ ktask_vfio 14.74 ( 0.04) 9.93 ( 0.09) usemem 10.45 ( 0.04) 10.75 ( 0.04) 14.14 ( 0.07) In the more realistic case 2, the effect is similar although not as pronounced. The usemem threads take 35% longer to finish with normal priority ktask threads than when run alone, but only 3% longer when MAX_NICE is used. All ktask users outside of VFIO boil down to page clearing, so I imagine the results would be similar for them. [1] lkml.kernel.org/r/20171206143509.GG7515@dhcp22.suse.cz Signed-off-by: Daniel Jordan --- kernel/ktask.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/kernel/ktask.c b/kernel/ktask.c index b91c62f14dcd..72293a0f50c3 100644 --- a/kernel/ktask.c +++ b/kernel/ktask.c @@ -575,6 +575,18 @@ void __init ktask_init(void) goto alloc_fail; } + /* + * All ktask worker threads have the lowest priority on the system so + * they don't disturb other workloads. + */ + attrs->nice = MAX_NICE; + + ret = apply_workqueue_attrs(ktask_wq, attrs); + if (ret != 0) { + pr_warn("disabled (couldn't apply attrs to ktask_wq)"); + goto apply_fail; + } + attrs->no_numa = true; ret = apply_workqueue_attrs(ktask_nonuma_wq, attrs); From patchwork Mon Nov 5 16:55:50 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668743 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 15A7014BD for ; Mon, 5 Nov 2018 16:59:57 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0372529B8D for ; Mon, 5 Nov 2018 16:59:57 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id EACFD29BA0; Mon, 5 Nov 2018 16:59:56 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id BD99C29B8D for ; Mon, 5 Nov 2018 16:59:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387831AbeKFCU1 (ORCPT ); Mon, 5 Nov 2018 21:20:27 -0500 Received: from userp2120.oracle.com ([156.151.31.85]:43434 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730068AbeKFCRV (ORCPT ); Mon, 5 Nov 2018 21:17:21 -0500 Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5Gs0Tw052185; Mon, 5 Nov 2018 16:56:16 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2018-07-02; bh=ocotEYS5gwBRk6gU87/nqGF4VbCejB5eUqqLuMjIfMQ=; b=mN54KvD5juUsCho+IUsHz9pdJiCENCu2Zk68E3tt4J+ZeqbK+u8lhaGnR6ede6YIV8Zi G0cJxC6Hstx3BlUz8Tv2uPx3bhRvWsGAXx9kbwg8Ly8VAVJlRrsl5c9iryriKxkNZdLB 77woyrxNGrg2hihBo1BHyHVoMpZzRMYwRqNLTV56De9PZFqzHVyjhuejpkJXBbSHW6Pe Ctie/HE8JUJ4+mwTFYPCAPSXCpILnHber5mldAbJ5QcWes73oNe55l7E7j/3l8EDceln qT51C5PTnfliZqYGCd7Ow1YgJLGX+/sSMn9Xw/XrlqskUtf5cqAZvluvCa0F7Y8TmInw ow== Received: from aserv0022.oracle.com (aserv0022.oracle.com [141.146.126.234]) by userp2120.oracle.com with ESMTP id 2nh4aqg2cx-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:16 +0000 Received: from userv0122.oracle.com (userv0122.oracle.com [156.151.31.75]) by aserv0022.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuFfJ022245 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:15 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by userv0122.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuEud010809; Mon, 5 Nov 2018 16:56:14 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:13 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 05/13] workqueue, ktask: renice helper threads to prevent starvation Date: Mon, 5 Nov 2018 11:55:50 -0500 Message-Id: <20181105165558.11698-6-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=2 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP With ktask helper threads running at MAX_NICE, it's possible for one or more of them to begin chunks of the task and then have their CPU time constrained by higher priority threads. The main ktask thread, running at normal priority, may finish all available chunks of the task and then wait on the MAX_NICE helpers to finish the last in-progress chunks, for longer than it would have if no helpers were used. Avoid this by having the main thread assign its priority to each unfinished helper one at a time so that on a heavily loaded system, exactly one thread in a given ktask call is running at the main thread's priority. At least one thread to ensure forward progress, and at most one thread to limit excessive multithreading. Since the workqueue interface, on which ktask is built, does not provide access to worker threads, ktask can't adjust their priorities directly, so add a new interface to allow a previously-queued work item to run at a different priority than the one controlled by the corresponding workqueue's 'nice' attribute. The worker assigned to the work item will run the work at the given priority, temporarily overriding the worker's priority. The interface is flush_work_at_nice, which ensures the given work item's assigned worker runs at the specified nice level and waits for the work item to finish. An alternative choice would have been to simply requeue the work item to a pool with workers of the new priority, but this doesn't seem feasible because a worker may have already started executing the work and there's currently no way to interrupt it midway through. The proposed interface solves this issue because a worker's priority can be adjusted while it's executing the work. TODO: flush_work_at_nice is a proof-of-concept only, and it may be desired to have the interface set the work's nice without also waiting for it to finish. It's implemented in the flush path for this RFC because it was fairly simple to write ;-) I ran tests similar to the ones in the last patch with a couple of differences: - The non-ktask workload uses 8 CPUs instead of 7 to compete with the main ktask thread as well as the ktask helpers, so that when the main thread finishes, its CPU is completely occupied by the non-ktask workload, meaning MAX_NICE helpers can't run as often. - The non-ktask workload starts before the ktask workload, rather than after, to maximize the chance that it starves helpers. Runtimes in seconds. Case 1: Synthetic, worst-case CPU contention ktask_test - a tight loop doing integer multiplication to max out on CPU; used for testing only, does not appear in this series stress-ng - cpu stressor ("-c --cpu-method ackerman --cpu-ops 1200"); 8_ktask_thrs 8_ktask_thrs w/o_renice (stdev) with_renice (stdev) 1_ktask_thr (stdev) ------------------------------------------------------------------ ktask_test 41.98 ( 0.22) 25.15 ( 2.98) 30.40 ( 0.61) stress-ng 44.79 ( 1.11) 46.37 ( 0.69) 53.29 ( 1.91) Without renicing, ktask_test finishes just after stress-ng does because stress-ng needs to free up CPUs for the helpers to finish (ktask_test shows a shorter runtime than stress-ng because ktask_test was started later). Renicing lets ktask_test finish 40% sooner, and running the same amount of work in ktask_test with 1 thread instead of 8 finishes in a comparable amount of time, though longer than "with_renice" because MAX_NICE threads still get some CPU time, and the effect over 8 threads adds up. stress-ng's total runtime gets a little longer going from no renicing to renicing, as expected, because each reniced ktask thread takes more CPU time than before when the helpers were starved. Running with one ktask thread, stress-ng's reported walltime goes up because that single thread interferes with fewer stress-ng threads, but with more impact, causing a greater spread in the time it takes for individual stress-ng threads to finish. Averages of the per-thread stress-ng times from "with_renice" to "1_ktask_thr" come out roughly the same, though, 43.81 and 43.89 respectively. So the total runtime of stress-ng across all threads is unaffected, but the time stress-ng takes to finish running its threads completely actually improves by spreading the ktask_test work over more threads. Case 2: Real-world CPU contention ktask_vfio - VFIO page pin a 32G kvm guest usemem - faults in 86G of anonymous THP per thread, PAGE_SIZE stride; used to mimic the page clearing that dominates in ktask_vfio so that usemem competes for the same system resources 8_ktask_thrs 8_ktask_thrs w/o_renice (stdev) with_renice (stdev) 1_ktask_thr (stdev) ------------------------------------------------------------------ ktask_vfio 18.59 ( 0.19) 14.62 ( 2.03) 16.24 ( 0.90) usemem 47.54 ( 0.89) 48.18 ( 0.77) 49.70 ( 1.20) These results are similar to case 1's, though the differences between times are not quite as pronounced because ktask_vfio ran shorter compared to usemem. Signed-off-by: Daniel Jordan --- include/linux/workqueue.h | 5 ++ kernel/ktask.c | 81 ++++++++++++++++++----------- kernel/workqueue.c | 106 +++++++++++++++++++++++++++++++++++--- 3 files changed, 156 insertions(+), 36 deletions(-) diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index 60d673e15632..d2976547c9c3 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -95,6 +95,10 @@ enum { WORK_BUSY_PENDING = 1 << 0, WORK_BUSY_RUNNING = 1 << 1, + /* flags for flush_work and similar functions */ + WORK_FLUSH_FROM_CANCEL = 1 << 0, + WORK_FLUSH_AT_NICE = 1 << 1, + /* maximum string length for set_worker_desc() */ WORKER_DESC_LEN = 24, }; @@ -477,6 +481,7 @@ extern int schedule_on_each_cpu(work_func_t func); int execute_in_process_context(work_func_t fn, struct execute_work *); extern bool flush_work(struct work_struct *work); +extern bool flush_work_at_nice(struct work_struct *work, long nice); extern bool cancel_work_sync(struct work_struct *work); extern bool flush_delayed_work(struct delayed_work *dwork); diff --git a/kernel/ktask.c b/kernel/ktask.c index 72293a0f50c3..9d2727ce430c 100644 --- a/kernel/ktask.c +++ b/kernel/ktask.c @@ -16,7 +16,6 @@ #include #include -#include #include #include #include @@ -24,6 +23,7 @@ #include #include #include +#include #include #include #include @@ -41,6 +41,11 @@ static size_t *ktask_rlim_node_max; #define KTASK_CPUFRAC_NUMER 4 #define KTASK_CPUFRAC_DENOM 5 +enum ktask_work_flags { + KTASK_WORK_FINISHED = 1, + KTASK_WORK_UNDO = 2, +}; + /* Used to pass ktask data to the workqueue API. */ struct ktask_work { struct work_struct kw_work; @@ -53,6 +58,7 @@ struct ktask_work { void *kw_error_end; /* ktask_free_works, kn_failed_works linkage */ struct list_head kw_list; + enum ktask_work_flags kw_flags; }; static LIST_HEAD(ktask_free_works); @@ -68,10 +74,7 @@ struct ktask_task { struct ktask_node *kt_nodes; size_t kt_nr_nodes; size_t kt_nr_nodes_left; - size_t kt_nworks; - size_t kt_nworks_fini; int kt_error; /* first error from thread_func */ - struct completion kt_ktask_done; }; /* @@ -97,6 +100,7 @@ static void ktask_init_work(struct ktask_work *kw, struct ktask_task *kt, kw->kw_task = kt; kw->kw_ktask_node_i = ktask_node_i; kw->kw_queue_nid = queue_nid; + kw->kw_flags = 0; } static void ktask_queue_work(struct ktask_work *kw) @@ -171,7 +175,6 @@ static void ktask_thread(struct work_struct *work) struct ktask_task *kt = kw->kw_task; struct ktask_ctl *kc = &kt->kt_ctl; struct ktask_node *kn = &kt->kt_nodes[kw->kw_ktask_node_i]; - bool done; mutex_lock(&kt->kt_mutex); @@ -239,6 +242,7 @@ static void ktask_thread(struct work_struct *work) * about where this thread failed for ktask_undo. */ if (kc->kc_undo_func) { + kw->kw_flags |= KTASK_WORK_UNDO; list_move(&kw->kw_list, &kn->kn_failed_works); kw->kw_error_start = position; kw->kw_error_offset = position_offset; @@ -250,13 +254,8 @@ static void ktask_thread(struct work_struct *work) WARN_ON(kt->kt_nr_nodes_left > 0 && kt->kt_error == KTASK_RETURN_SUCCESS); - ++kt->kt_nworks_fini; - WARN_ON(kt->kt_nworks_fini > kt->kt_nworks); - done = (kt->kt_nworks_fini == kt->kt_nworks); + kw->kw_flags |= KTASK_WORK_FINISHED; mutex_unlock(&kt->kt_mutex); - - if (done) - complete(&kt->kt_ktask_done); } /* @@ -294,7 +293,7 @@ static size_t ktask_chunk_size(size_t task_size, size_t min_chunk_size, */ static size_t ktask_init_works(struct ktask_node *nodes, size_t nr_nodes, struct ktask_task *kt, - struct list_head *works_list) + struct list_head *unfinished_works) { size_t i, nr_works, nr_works_check; size_t min_chunk_size = kt->kt_ctl.kc_min_chunk_size; @@ -342,7 +341,7 @@ static size_t ktask_init_works(struct ktask_node *nodes, size_t nr_nodes, WARN_ON(list_empty(&ktask_free_works)); kw = list_first_entry(&ktask_free_works, struct ktask_work, kw_list); - list_move_tail(&kw->kw_list, works_list); + list_move_tail(&kw->kw_list, unfinished_works); ktask_init_work(kw, kt, ktask_node_i, queue_nid); ++ktask_rlim_cur; @@ -355,14 +354,14 @@ static size_t ktask_init_works(struct ktask_node *nodes, size_t nr_nodes, static void ktask_fini_works(struct ktask_task *kt, struct ktask_work *stack_work, - struct list_head *works_list) + struct list_head *finished_works) { struct ktask_work *work, *next; spin_lock(&ktask_rlim_lock); /* Put the works back on the free list, adjusting rlimits. */ - list_for_each_entry_safe(work, next, works_list, kw_list) { + list_for_each_entry_safe(work, next, finished_works, kw_list) { if (work == stack_work) { /* On this thread's stack, so not subject to rlimits. */ list_del(&work->kw_list); @@ -393,7 +392,7 @@ static int ktask_error_cmp(void *unused, struct list_head *a, } static void ktask_undo(struct ktask_node *nodes, size_t nr_nodes, - struct ktask_ctl *ctl, struct list_head *works_list) + struct ktask_ctl *ctl, struct list_head *finished_works) { size_t i; @@ -424,7 +423,8 @@ static void ktask_undo(struct ktask_node *nodes, size_t nr_nodes, if (failed_work) { undo_pos = failed_work->kw_error_end; - list_move(&failed_work->kw_list, works_list); + list_move(&failed_work->kw_list, + finished_works); } else { undo_pos = undo_end; } @@ -433,20 +433,46 @@ static void ktask_undo(struct ktask_node *nodes, size_t nr_nodes, } } +static void ktask_wait_for_completion(struct ktask_task *kt, + struct list_head *unfinished_works, + struct list_head *finished_works) +{ + struct ktask_work *work; + + mutex_lock(&kt->kt_mutex); + while (!list_empty(unfinished_works)) { + work = list_first_entry(unfinished_works, struct ktask_work, + kw_list); + if (!(work->kw_flags & KTASK_WORK_FINISHED)) { + mutex_unlock(&kt->kt_mutex); + flush_work_at_nice(&work->kw_work, task_nice(current)); + mutex_lock(&kt->kt_mutex); + WARN_ON_ONCE(!(work->kw_flags & KTASK_WORK_FINISHED)); + } + /* + * Leave works used in ktask_undo on kn->kn_failed_works. + * ktask_undo will move them to finished_works. + */ + if (!(work->kw_flags & KTASK_WORK_UNDO)) + list_move(&work->kw_list, finished_works); + } + mutex_unlock(&kt->kt_mutex); +} + int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes, struct ktask_ctl *ctl) { - size_t i; + size_t i, nr_works; struct ktask_work kw; struct ktask_work *work; - LIST_HEAD(works_list); + LIST_HEAD(unfinished_works); + LIST_HEAD(finished_works); struct ktask_task kt = { .kt_ctl = *ctl, .kt_total_size = 0, .kt_nodes = nodes, .kt_nr_nodes = nr_nodes, .kt_nr_nodes_left = nr_nodes, - .kt_nworks_fini = 0, .kt_error = KTASK_RETURN_SUCCESS, }; @@ -465,14 +491,12 @@ int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes, return KTASK_RETURN_SUCCESS; mutex_init(&kt.kt_mutex); - init_completion(&kt.kt_ktask_done); - kt.kt_nworks = ktask_init_works(nodes, nr_nodes, &kt, &works_list); + nr_works = ktask_init_works(nodes, nr_nodes, &kt, &unfinished_works); kt.kt_chunk_size = ktask_chunk_size(kt.kt_total_size, - ctl->kc_min_chunk_size, - kt.kt_nworks); + ctl->kc_min_chunk_size, nr_works); - list_for_each_entry(work, &works_list, kw_list) + list_for_each_entry(work, &unfinished_works, kw_list) ktask_queue_work(work); /* Use the current thread, which saves starting a workqueue worker. */ @@ -480,13 +504,12 @@ int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes, INIT_LIST_HEAD(&kw.kw_list); ktask_thread(&kw.kw_work); - /* Wait for all the jobs to finish. */ - wait_for_completion(&kt.kt_ktask_done); + ktask_wait_for_completion(&kt, &unfinished_works, &finished_works); if (kt.kt_error && ctl->kc_undo_func) - ktask_undo(nodes, nr_nodes, ctl, &works_list); + ktask_undo(nodes, nr_nodes, ctl, &finished_works); - ktask_fini_works(&kt, &kw, &works_list); + ktask_fini_works(&kt, &kw, &finished_works); mutex_destroy(&kt.kt_mutex); return kt.kt_error; diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 0280deac392e..9fbae3fc9cca 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -79,6 +79,7 @@ enum { WORKER_CPU_INTENSIVE = 1 << 6, /* cpu intensive */ WORKER_UNBOUND = 1 << 7, /* worker is unbound */ WORKER_REBOUND = 1 << 8, /* worker was rebound */ + WORKER_NICED = 1 << 9, /* worker's nice was adjusted */ WORKER_NOT_RUNNING = WORKER_PREP | WORKER_CPU_INTENSIVE | WORKER_UNBOUND | WORKER_REBOUND, @@ -2184,6 +2185,18 @@ __acquires(&pool->lock) if (unlikely(cpu_intensive)) worker_clr_flags(worker, WORKER_CPU_INTENSIVE); + /* + * worker's nice level was adjusted (see flush_work_at_nice). Use the + * work's color to distinguish between the work that sets the nice + * level (== NO_COLOR) and the work for which the adjustment was made + * (!= NO_COLOR) to avoid prematurely restoring the nice level. + */ + if (unlikely(worker->flags & WORKER_NICED && + work_color != WORK_NO_COLOR)) { + set_user_nice(worker->task, worker->pool->attrs->nice); + worker_clr_flags(worker, WORKER_NICED); + } + /* we're done with it, release */ hash_del(&worker->hentry); worker->current_work = NULL; @@ -2846,8 +2859,53 @@ void drain_workqueue(struct workqueue_struct *wq) } EXPORT_SYMBOL_GPL(drain_workqueue); +struct nice_work { + struct work_struct work; + long nice; +}; + +static void nice_work_func(struct work_struct *work) +{ + struct nice_work *nw = container_of(work, struct nice_work, work); + struct worker *worker = current_wq_worker(); + + if (WARN_ON_ONCE(!worker)) + return; + + set_user_nice(current, nw->nice); + worker->flags |= WORKER_NICED; +} + +/** + * insert_nice_work - insert a nice_work into a pwq + * @pwq: pwq to insert nice_work into + * @nice_work: nice_work to insert + * @target: target work to attach @nice_work to + * + * @nice_work is linked to @target such that @target starts executing only + * after @nice_work finishes execution. + * + * @nice_work's only job is to ensure @target's assigned worker runs at the + * nice level contained in @nice_work. + * + * CONTEXT: + * spin_lock_irq(pool->lock). + */ +static void insert_nice_work(struct pool_workqueue *pwq, + struct nice_work *nice_work, + struct work_struct *target) +{ + /* see comment above similar code in insert_wq_barrier */ + INIT_WORK_ONSTACK(&nice_work->work, nice_work_func); + __set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&nice_work->work)); + + debug_work_activate(&nice_work->work); + insert_work(pwq, &nice_work->work, &target->entry, + work_color_to_flags(WORK_NO_COLOR) | WORK_STRUCT_LINKED); +} + static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr, - bool from_cancel) + struct nice_work *nice_work, int flags) { struct worker *worker = NULL; struct worker_pool *pool; @@ -2868,11 +2926,19 @@ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr, if (pwq) { if (unlikely(pwq->pool != pool)) goto already_gone; + + /* not yet started, insert linked work before work */ + if (unlikely(flags & WORK_FLUSH_AT_NICE)) + insert_nice_work(pwq, nice_work, work); } else { worker = find_worker_executing_work(pool, work); if (!worker) goto already_gone; pwq = worker->current_pwq; + if (unlikely(flags & WORK_FLUSH_AT_NICE)) { + set_user_nice(worker->task, nice_work->nice); + worker->flags |= WORKER_NICED; + } } check_flush_dependency(pwq->wq, work); @@ -2889,7 +2955,7 @@ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr, * workqueues the deadlock happens when the rescuer stalls, blocking * forward progress. */ - if (!from_cancel && + if (!(flags & WORK_FLUSH_FROM_CANCEL) && (pwq->wq->saved_max_active == 1 || pwq->wq->rescuer)) { lock_map_acquire(&pwq->wq->lockdep_map); lock_map_release(&pwq->wq->lockdep_map); @@ -2901,19 +2967,23 @@ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr, return false; } -static bool __flush_work(struct work_struct *work, bool from_cancel) +static bool __flush_work(struct work_struct *work, int flags, long nice) { struct wq_barrier barr; + struct nice_work nice_work; if (WARN_ON(!wq_online)) return false; - if (!from_cancel) { + if (!(flags & WORK_FLUSH_FROM_CANCEL)) { lock_map_acquire(&work->lockdep_map); lock_map_release(&work->lockdep_map); } - if (start_flush_work(work, &barr, from_cancel)) { + if (unlikely(flags & WORK_FLUSH_AT_NICE)) + nice_work.nice = nice; + + if (start_flush_work(work, &barr, &nice_work, flags)) { wait_for_completion(&barr.done); destroy_work_on_stack(&barr.work); return true; @@ -2935,10 +3005,32 @@ static bool __flush_work(struct work_struct *work, bool from_cancel) */ bool flush_work(struct work_struct *work) { - return __flush_work(work, false); + return __flush_work(work, 0, 0); } EXPORT_SYMBOL_GPL(flush_work); +/** + * flush_work_at_nice - set a work's nice level and wait for it to finish + * @work: the target work + * @nice: nice level @work's assigned worker should run at + * + * Makes @work's assigned worker run at @nice for the duration of @work. + * Waits until @work has finished execution. @work is guaranteed to be idle + * on return if it hasn't been requeued since flush started. + * + * Avoids priority inversion where a high priority task queues @work on a + * workqueue with low priority workers and may wait indefinitely for @work's + * completion. That task can will its priority to @work. + * + * Return: + * %true if flush_work_at_nice() waited for the work to finish execution, + * %false if it was already idle. + */ +bool flush_work_at_nice(struct work_struct *work, long nice) +{ + return __flush_work(work, WORK_FLUSH_AT_NICE, nice); +} + struct cwt_wait { wait_queue_entry_t wait; struct work_struct *work; @@ -3001,7 +3093,7 @@ static bool __cancel_work_timer(struct work_struct *work, bool is_dwork) * isn't executing. */ if (wq_online) - __flush_work(work, true); + __flush_work(work, WORK_FLUSH_FROM_CANCEL, 0); clear_work_data(work); From patchwork Mon Nov 5 16:55:51 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668731 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id EDE3C14E2 for ; Mon, 5 Nov 2018 16:59:25 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id DC3C229B8D for ; Mon, 5 Nov 2018 16:59:25 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id CFE9729BA0; Mon, 5 Nov 2018 16:59:25 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F303129B8D for ; Mon, 5 Nov 2018 16:59:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387675AbeKFCRZ (ORCPT ); Mon, 5 Nov 2018 21:17:25 -0500 Received: from userp2130.oracle.com ([156.151.31.86]:38752 "EHLO userp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387603AbeKFCRX (ORCPT ); Mon, 5 Nov 2018 21:17:23 -0500 Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5GsADv097225; Mon, 5 Nov 2018 16:56:23 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2018-07-02; bh=W/3M3WgNkyPY7/D1fTX9bMKrquODyVvPod8Snk5MTFI=; b=qleIjIaYJDexWrd2rkhbiQVJrDE7fdLiJs5O9JaxGpJ1v+bzuJu5C6YgyaHn1KnTaX35 bwg0Zw+scNeWJC2Q2LA8tLzwOYwoWPjS7+Z36OEWqdsr0xd6BJ5CpqLNuusLaorsTmFU 1oInGkb/CSCy/rqjH/GRsKSCnstAQlQiMHWaPbGas0m77RdN6TS7lePa8Sn/bYowqmuk 6HqP7REWoucTlRd1X0rd7eH2n5cnxFPy6vin82CMBSiLhbOZVz2LnywpTFMa6C0Rr0Gi G1579hYh9MYMwuYWFfV8m5h2mrK0i8oiis619swGfHr7SspVZa6VbpkDV8V49Cv0dljf mQ== Received: from aserv0022.oracle.com (aserv0022.oracle.com [141.146.126.234]) by userp2130.oracle.com with ESMTP id 2nh33tr6be-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:23 +0000 Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by aserv0022.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuHRq022308 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:17 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id wA5GuGGF008403; Mon, 5 Nov 2018 16:56:16 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:15 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 06/13] vfio: parallelize vfio_pin_map_dma Date: Mon, 5 Nov 2018 11:55:51 -0500 Message-Id: <20181105165558.11698-7-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP When starting a large-memory kvm guest, it takes an excessively long time to start the boot process because qemu must pin all guest pages to accommodate DMA when VFIO is in use. Currently just one CPU is responsible for the page pinning, which usually boils down to page clearing time-wise, so the ways to optimize this are buying a faster CPU ;-) or using more of the CPUs you already have. Parallelize with ktask. Refactor so workqueue workers pin with the mm of the calling thread, and to enable an undo callback for ktask to handle errors during page pinning. Performance results appear later in the series. Signed-off-by: Daniel Jordan --- drivers/vfio/vfio_iommu_type1.c | 106 +++++++++++++++++++++++--------- 1 file changed, 76 insertions(+), 30 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index d9fd3188615d..e7cfbf0c8071 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -41,6 +41,7 @@ #include #include #include +#include #define DRIVER_VERSION "0.2" #define DRIVER_AUTHOR "Alex Williamson " @@ -395,7 +396,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, */ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, long npage, unsigned long *pfn_base, - unsigned long limit) + unsigned long limit, struct mm_struct *mm) { unsigned long pfn = 0; long ret, pinned = 0, lock_acct = 0; @@ -403,10 +404,10 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, dma_addr_t iova = vaddr - dma->vaddr + dma->iova; /* This code path is only user initiated */ - if (!current->mm) + if (!mm) return -ENODEV; - ret = vaddr_get_pfn(current->mm, vaddr, dma->prot, pfn_base); + ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base); if (ret) return ret; @@ -418,7 +419,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, * pages are already counted against the user. */ if (!rsvd && !vfio_find_vpfn(dma, iova)) { - if (!dma->lock_cap && current->mm->locked_vm + 1 > limit) { + if (!dma->lock_cap && mm->locked_vm + 1 > limit) { put_pfn(*pfn_base, dma->prot); pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__, limit << PAGE_SHIFT); @@ -433,7 +434,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, /* Lock all the consecutive pages from pfn_base */ for (vaddr += PAGE_SIZE, iova += PAGE_SIZE; pinned < npage; pinned++, vaddr += PAGE_SIZE, iova += PAGE_SIZE) { - ret = vaddr_get_pfn(current->mm, vaddr, dma->prot, &pfn); + ret = vaddr_get_pfn(mm, vaddr, dma->prot, &pfn); if (ret) break; @@ -445,7 +446,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, if (!rsvd && !vfio_find_vpfn(dma, iova)) { if (!dma->lock_cap && - current->mm->locked_vm + lock_acct + 1 > limit) { + mm->locked_vm + lock_acct + 1 > limit) { put_pfn(pfn, dma->prot); pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__, limit << PAGE_SHIFT); @@ -752,15 +753,15 @@ static size_t unmap_unpin_slow(struct vfio_domain *domain, } static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma, + dma_addr_t iova, dma_addr_t end, bool do_accounting) { - dma_addr_t iova = dma->iova, end = dma->iova + dma->size; struct vfio_domain *domain, *d; LIST_HEAD(unmapped_region_list); int unmapped_region_cnt = 0; long unlocked = 0; - if (!dma->size) + if (iova == end) return 0; if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) @@ -777,7 +778,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma, struct vfio_domain, next); list_for_each_entry_continue(d, &iommu->domain_list, next) { - iommu_unmap(d->domain, dma->iova, dma->size); + iommu_unmap(d->domain, iova, end - iova); cond_resched(); } @@ -818,8 +819,6 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma, } } - dma->iommu_mapped = false; - if (unmapped_region_cnt) unlocked += vfio_sync_unpin(dma, domain, &unmapped_region_list); @@ -830,14 +829,21 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma, return unlocked; } -static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma) +static void vfio_remove_dma_finish(struct vfio_iommu *iommu, + struct vfio_dma *dma) { - vfio_unmap_unpin(iommu, dma, true); + dma->iommu_mapped = false; vfio_unlink_dma(iommu, dma); put_task_struct(dma->task); kfree(dma); } +static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma) +{ + vfio_unmap_unpin(iommu, dma, dma->iova, dma->iova + dma->size, true); + vfio_remove_dma_finish(iommu, dma); +} + static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu) { struct vfio_domain *domain; @@ -1031,20 +1037,29 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova, return ret; } -static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, - size_t map_size) +struct vfio_pin_args { + struct vfio_iommu *iommu; + struct vfio_dma *dma; + unsigned long limit; + struct mm_struct *mm; +}; + +static int vfio_pin_map_dma_chunk(unsigned long start_vaddr, + unsigned long end_vaddr, + struct vfio_pin_args *args) { - dma_addr_t iova = dma->iova; - unsigned long vaddr = dma->vaddr; - size_t size = map_size; + struct vfio_dma *dma = args->dma; + dma_addr_t iova = dma->iova + (start_vaddr - dma->vaddr); + unsigned long unmapped_size = end_vaddr - start_vaddr; + unsigned long pfn, mapped_size = 0; long npage; - unsigned long pfn, limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; int ret = 0; - while (size) { + while (unmapped_size) { /* Pin a contiguous chunk of memory */ - npage = vfio_pin_pages_remote(dma, vaddr + dma->size, - size >> PAGE_SHIFT, &pfn, limit); + npage = vfio_pin_pages_remote(dma, start_vaddr + mapped_size, + unmapped_size >> PAGE_SHIFT, + &pfn, args->limit, args->mm); if (npage <= 0) { WARN_ON(!npage); ret = (int)npage; @@ -1052,22 +1067,50 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, } /* Map it! */ - ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, - dma->prot); + ret = vfio_iommu_map(args->iommu, iova + mapped_size, pfn, + npage, dma->prot); if (ret) { - vfio_unpin_pages_remote(dma, iova + dma->size, pfn, + vfio_unpin_pages_remote(dma, iova + mapped_size, pfn, npage, true); break; } - size -= npage << PAGE_SHIFT; - dma->size += npage << PAGE_SHIFT; + unmapped_size -= npage << PAGE_SHIFT; + mapped_size += npage << PAGE_SHIFT; } + return (ret == 0) ? KTASK_RETURN_SUCCESS : ret; +} + +static void vfio_pin_map_dma_undo(unsigned long start_vaddr, + unsigned long end_vaddr, + struct vfio_pin_args *args) +{ + struct vfio_dma *dma = args->dma; + dma_addr_t iova = dma->iova + (start_vaddr - dma->vaddr); + dma_addr_t end = dma->iova + (end_vaddr - dma->vaddr); + + vfio_unmap_unpin(args->iommu, args->dma, iova, end, true); +} + +static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, + size_t map_size) +{ + unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; + int ret = 0; + struct vfio_pin_args args = { iommu, dma, limit, current->mm }; + /* Stay on PMD boundary in case THP is being used. */ + DEFINE_KTASK_CTL(ctl, vfio_pin_map_dma_chunk, &args, PMD_SIZE); + + ktask_ctl_set_undo_func(&ctl, vfio_pin_map_dma_undo); + ret = ktask_run((void *)dma->vaddr, map_size, &ctl); + dma->iommu_mapped = true; if (ret) - vfio_remove_dma(iommu, dma); + vfio_remove_dma_finish(iommu, dma); + else + dma->size += map_size; return ret; } @@ -1229,7 +1272,8 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, npage = vfio_pin_pages_remote(dma, vaddr, n >> PAGE_SHIFT, - &pfn, limit); + &pfn, limit, + current->mm); if (npage <= 0) { WARN_ON(!npage); ret = (int)npage; @@ -1497,7 +1541,9 @@ static void vfio_iommu_unmap_unpin_reaccount(struct vfio_iommu *iommu) long locked = 0, unlocked = 0; dma = rb_entry(n, struct vfio_dma, node); - unlocked += vfio_unmap_unpin(iommu, dma, false); + unlocked += vfio_unmap_unpin(iommu, dma, dma->iova, + dma->iova + dma->size, false); + dma->iommu_mapped = false; p = rb_first(&dma->pfn_list); for (; p; p = rb_next(p)) { struct vfio_pfn *vpfn = rb_entry(p, struct vfio_pfn, From patchwork Mon Nov 5 16:55:52 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668735 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6346114E2 for ; Mon, 5 Nov 2018 16:59:40 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 50D7C29B8D for ; Mon, 5 Nov 2018 16:59:40 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 4457529BA0; Mon, 5 Nov 2018 16:59:40 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3DAA029B8D for ; Mon, 5 Nov 2018 16:59:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387657AbeKFCRZ (ORCPT ); Mon, 5 Nov 2018 21:17:25 -0500 Received: from userp2130.oracle.com ([156.151.31.86]:38754 "EHLO userp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387598AbeKFCRY (ORCPT ); Mon, 5 Nov 2018 21:17:24 -0500 Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5Gs1eX097148; Mon, 5 Nov 2018 16:56:19 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2018-07-02; bh=kTwyB7gUqH7ZPYa5YenpKewBt6cK4ioVbtrtT8tR8jw=; b=2qCbR5C3zZAqth0hgukqdlybcV4uHhL5fs1YXUBigjThoXJj6vgpqR/0m6GGO1TvcjUO GKEA2ixpVWQNoigKTcwSGoaVMvCtkkXYNzBXrlIfd8itxzOhfXQLkXswQXI6/2VnsuWv Jz7vdnLJMcMqu7h2zCsaiSNo7C+J5Ihn0Ttn3aIPNsvQBr5GebXjUrjemEbEgbxHgI6M qdex2jnmyZCoZ2ZY+WLq0y/n2abwOCE76N83Ww4e6k0uyMumWH8taJitb/NgdpmYWGjI ZKwU6nWWsP0KzteRuU69AVpdGlTPqYWIYMsFD/7qvy0a+50JdFlFpQSE3uK/W+0ob06L CA== Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by userp2130.oracle.com with ESMTP id 2nh33tr6ay-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:18 +0000 Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuILC004739 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:18 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuHKW017779; Mon, 5 Nov 2018 16:56:17 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:17 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 07/13] mm: change locked_vm's type from unsigned long to atomic_long_t Date: Mon, 5 Nov 2018 11:55:52 -0500 Message-Id: <20181105165558.11698-8-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Currently, mmap_sem must be held as writer to modify the locked_vm field in mm_struct. This creates a bottleneck when multithreading VFIO page pinning because each thread holds the mmap_sem as reader for the majority of the pinning time but also takes mmap_sem as writer regularly, for short times, when modifying locked_vm. The problem gets worse when other workloads compete for CPU with ktask threads doing page pinning because the other workloads force ktask threads that hold mmap_sem as writer off the CPU, blocking ktask threads trying to get mmap_sem as reader for an excessively long time (the mmap_sem reader wait time grows linearly with the thread count). Requiring mmap_sem for locked_vm also abuses mmap_sem by making it protect data that could be synchronized separately. So, decouple locked_vm from mmap_sem by making locked_vm an atomic_long_t. locked_vm's old type was unsigned long and changing it to a signed type makes it lose half its capacity, but that's only a concern for 32-bit systems and LONG_MAX * PAGE_SIZE is 8T on x86 in that case, so there's headroom. Now that mmap_sem is not taken as writer here, ktask threads holding mmap_sem as reader can run more often. Performance results appear later in the series. On powerpc, this was cross-compiled-tested only. [XXX Can send separately.] Signed-off-by: Daniel Jordan --- arch/powerpc/kvm/book3s_64_vio.c | 15 ++++++++------- arch/powerpc/mm/mmu_context_iommu.c | 16 ++++++++-------- drivers/fpga/dfl-afu-dma-region.c | 16 +++++++++------- drivers/vfio/vfio_iommu_spapr_tce.c | 14 +++++++------- drivers/vfio/vfio_iommu_type1.c | 11 ++++++----- fs/proc/task_mmu.c | 2 +- include/linux/mm_types.h | 2 +- kernel/fork.c | 2 +- mm/debug.c | 3 ++- mm/mlock.c | 4 ++-- mm/mmap.c | 18 +++++++++--------- mm/mremap.c | 6 +++--- 12 files changed, 57 insertions(+), 52 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index 62a8d03ba7e9..b5637de6dde5 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -58,33 +58,34 @@ static unsigned long kvmppc_stt_pages(unsigned long tce_pages) static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc) { - long ret = 0; + long locked_vm, ret = 0; if (!current || !current->mm) return ret; /* process exited */ down_write(¤t->mm->mmap_sem); + locked_vm = atomic_long_read(¤t->mm->locked_vm); if (inc) { unsigned long locked, lock_limit; - locked = current->mm->locked_vm + stt_pages; + locked = locked_vm + stt_pages; lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; if (locked > lock_limit && !capable(CAP_IPC_LOCK)) ret = -ENOMEM; else - current->mm->locked_vm += stt_pages; + atomic_long_add(stt_pages, ¤t->mm->locked_vm); } else { - if (WARN_ON_ONCE(stt_pages > current->mm->locked_vm)) - stt_pages = current->mm->locked_vm; + if (WARN_ON_ONCE(stt_pages > locked_vm)) + stt_pages = locked_vm; - current->mm->locked_vm -= stt_pages; + atomic_long_sub(stt_pages, ¤t->mm->locked_vm); } pr_debug("[%d] RLIMIT_MEMLOCK KVM %c%ld %ld/%ld%s\n", current->pid, inc ? '+' : '-', stt_pages << PAGE_SHIFT, - current->mm->locked_vm << PAGE_SHIFT, + atomic_long_read(¤t->mm->locked_vm) << PAGE_SHIFT, rlimit(RLIMIT_MEMLOCK), ret ? " - exceeded" : ""); diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c index 56c2234cc6ae..a8f66975bf53 100644 --- a/arch/powerpc/mm/mmu_context_iommu.c +++ b/arch/powerpc/mm/mmu_context_iommu.c @@ -41,31 +41,31 @@ struct mm_iommu_table_group_mem_t { static long mm_iommu_adjust_locked_vm(struct mm_struct *mm, unsigned long npages, bool incr) { - long ret = 0, locked, lock_limit; + long ret = 0, locked, lock_limit, locked_vm; if (!npages) return 0; down_write(&mm->mmap_sem); - + locked_vm = atomic_long_read(&mm->locked_vm); if (incr) { - locked = mm->locked_vm + npages; + locked = locked_vm + npages; lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; if (locked > lock_limit && !capable(CAP_IPC_LOCK)) ret = -ENOMEM; else - mm->locked_vm += npages; + atomic_long_add(npages, &mm->locked_vm); } else { - if (WARN_ON_ONCE(npages > mm->locked_vm)) - npages = mm->locked_vm; - mm->locked_vm -= npages; + if (WARN_ON_ONCE(npages > locked_vm)) + npages = locked_vm; + atomic_long_sub(npages, &mm->locked_vm); } pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%ld %ld/%ld\n", current ? current->pid : 0, incr ? '+' : '-', npages << PAGE_SHIFT, - mm->locked_vm << PAGE_SHIFT, + atomic_long_read(&mm->locked_vm) << PAGE_SHIFT, rlimit(RLIMIT_MEMLOCK)); up_write(&mm->mmap_sem); diff --git a/drivers/fpga/dfl-afu-dma-region.c b/drivers/fpga/dfl-afu-dma-region.c index 025aba3ea76c..1a7939c511a0 100644 --- a/drivers/fpga/dfl-afu-dma-region.c +++ b/drivers/fpga/dfl-afu-dma-region.c @@ -45,6 +45,7 @@ void afu_dma_region_init(struct dfl_feature_platform_data *pdata) static int afu_dma_adjust_locked_vm(struct device *dev, long npages, bool incr) { unsigned long locked, lock_limit; + long locked_vm; int ret = 0; /* the task is exiting. */ @@ -53,24 +54,25 @@ static int afu_dma_adjust_locked_vm(struct device *dev, long npages, bool incr) down_write(¤t->mm->mmap_sem); + locked_vm = atomic_long_read(¤t->mm->locked_vm); if (incr) { - locked = current->mm->locked_vm + npages; + locked = locked_vm + npages; lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; if (locked > lock_limit && !capable(CAP_IPC_LOCK)) ret = -ENOMEM; else - current->mm->locked_vm += npages; + atomic_long_add(npages, ¤t->mm->locked_vm); } else { - if (WARN_ON_ONCE(npages > current->mm->locked_vm)) - npages = current->mm->locked_vm; - current->mm->locked_vm -= npages; + if (WARN_ON_ONCE(npages > locked_vm)) + npages = locked_vm; + atomic_long_sub(npages, ¤t->mm->locked_vm); } dev_dbg(dev, "[%d] RLIMIT_MEMLOCK %c%ld %ld/%ld%s\n", current->pid, incr ? '+' : '-', npages << PAGE_SHIFT, - current->mm->locked_vm << PAGE_SHIFT, rlimit(RLIMIT_MEMLOCK), - ret ? "- exceeded" : ""); + atomic_long_read(¤t->mm->locked_vm) << PAGE_SHIFT, + rlimit(RLIMIT_MEMLOCK), ret ? "- exceeded" : ""); up_write(¤t->mm->mmap_sem); diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index b30926e11d87..c834bc2d1e6b 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -45,16 +45,16 @@ static long try_increment_locked_vm(struct mm_struct *mm, long npages) return 0; down_write(&mm->mmap_sem); - locked = mm->locked_vm + npages; + locked = atomic_long_read(&mm->locked_vm) + npages; lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; if (locked > lock_limit && !capable(CAP_IPC_LOCK)) ret = -ENOMEM; else - mm->locked_vm += npages; + atomic_long_add(npages, &mm->locked_vm); pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid, npages << PAGE_SHIFT, - mm->locked_vm << PAGE_SHIFT, + atomic_long_read(&mm->locked_vm) << PAGE_SHIFT, rlimit(RLIMIT_MEMLOCK), ret ? " - exceeded" : ""); @@ -69,12 +69,12 @@ static void decrement_locked_vm(struct mm_struct *mm, long npages) return; down_write(&mm->mmap_sem); - if (WARN_ON_ONCE(npages > mm->locked_vm)) - npages = mm->locked_vm; - mm->locked_vm -= npages; + if (WARN_ON_ONCE(npages > atomic_long_read(&mm->locked_vm))) + npages = atomic_long_read(&mm->locked_vm); + atomic_long_sub(npages, &mm->locked_vm); pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid, npages << PAGE_SHIFT, - mm->locked_vm << PAGE_SHIFT, + atomic_long_read(&mm->locked_vm) << PAGE_SHIFT, rlimit(RLIMIT_MEMLOCK)); up_write(&mm->mmap_sem); } diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index e7cfbf0c8071..f307dc9d5e19 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -276,13 +276,13 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) limit = task_rlimit(dma->task, RLIMIT_MEMLOCK) >> PAGE_SHIFT; - if (mm->locked_vm + npage > limit) + if (atomic_long_read(&mm->locked_vm) + npage > limit) ret = -ENOMEM; } } if (!ret) - mm->locked_vm += npage; + atomic_long_add(npage, &mm->locked_vm); up_write(&mm->mmap_sem); } @@ -419,7 +419,8 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, * pages are already counted against the user. */ if (!rsvd && !vfio_find_vpfn(dma, iova)) { - if (!dma->lock_cap && mm->locked_vm + 1 > limit) { + if (!dma->lock_cap && + atomic_long_read(&mm->locked_vm) + 1 > limit) { put_pfn(*pfn_base, dma->prot); pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__, limit << PAGE_SHIFT); @@ -445,8 +446,8 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, } if (!rsvd && !vfio_find_vpfn(dma, iova)) { - if (!dma->lock_cap && - mm->locked_vm + lock_acct + 1 > limit) { + if (!dma->lock_cap && atomic_long_read(&mm->locked_vm) + + lock_acct + 1 > limit) { put_pfn(pfn, dma->prot); pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__, limit << PAGE_SHIFT); diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 39e96a21366e..f0468bfe022e 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -58,7 +58,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) swap = get_mm_counter(mm, MM_SWAPENTS); SEQ_PUT_DEC("VmPeak:\t", hiwater_vm); SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm); - SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm); + SEQ_PUT_DEC(" kB\nVmLck:\t", atomic_long_read(&mm->locked_vm)); SEQ_PUT_DEC(" kB\nVmPin:\t", mm->pinned_vm); SEQ_PUT_DEC(" kB\nVmHWM:\t", hiwater_rss); SEQ_PUT_DEC(" kB\nVmRSS:\t", total_rss); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 5ed8f6292a53..ee02933d10e5 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -399,7 +399,7 @@ struct mm_struct { unsigned long hiwater_vm; /* High-water virtual memory usage */ unsigned long total_vm; /* Total pages mapped */ - unsigned long locked_vm; /* Pages that have PG_mlocked set */ + atomic_long_t locked_vm; /* Pages that have PG_mlocked set */ unsigned long pinned_vm; /* Refcount permanently increased */ unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */ unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE & ~VM_STACK */ diff --git a/kernel/fork.c b/kernel/fork.c index 2f78d32eaa0f..e0d4d5b8f151 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -978,7 +978,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, mm->core_state = NULL; mm_pgtables_bytes_init(mm); mm->map_count = 0; - mm->locked_vm = 0; + atomic_long_set(&mm->locked_vm, 0); mm->pinned_vm = 0; memset(&mm->rss_stat, 0, sizeof(mm->rss_stat)); spin_lock_init(&mm->page_table_lock); diff --git a/mm/debug.c b/mm/debug.c index cdacba12e09a..3d8db069176f 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -152,7 +152,8 @@ void dump_mm(const struct mm_struct *mm) atomic_read(&mm->mm_count), mm_pgtables_bytes(mm), mm->map_count, - mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, mm->locked_vm, + mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, + atomic_long_read(&mm->locked_vm), mm->pinned_vm, mm->data_vm, mm->exec_vm, mm->stack_vm, mm->start_code, mm->end_code, mm->start_data, mm->end_data, mm->start_brk, mm->brk, mm->start_stack, diff --git a/mm/mlock.c b/mm/mlock.c index 41cc47e28ad6..b213ad97585c 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -562,7 +562,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev, nr_pages = -nr_pages; else if (old_flags & VM_LOCKED) nr_pages = 0; - mm->locked_vm += nr_pages; + atomic_long_add(nr_pages, &mm->locked_vm); /* * vm_flags is protected by the mmap_sem held in write mode. @@ -687,7 +687,7 @@ static __must_check int do_mlock(unsigned long start, size_t len, vm_flags_t fla if (down_write_killable(¤t->mm->mmap_sem)) return -EINTR; - locked += current->mm->locked_vm; + locked += atomic_long_read(¤t->mm->locked_vm); if ((locked > lock_limit) && (!capable(CAP_IPC_LOCK))) { /* * It is possible that the regions requested intersect with diff --git a/mm/mmap.c b/mm/mmap.c index 6c04292e16a7..5c4b39b99fb4 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1339,7 +1339,7 @@ static inline int mlock_future_check(struct mm_struct *mm, /* mlock MCL_FUTURE? */ if (flags & VM_LOCKED) { locked = len >> PAGE_SHIFT; - locked += mm->locked_vm; + locked += atomic_long_read(&mm->locked_vm); lock_limit = rlimit(RLIMIT_MEMLOCK); lock_limit >>= PAGE_SHIFT; if (locked > lock_limit && !capable(CAP_IPC_LOCK)) @@ -1825,7 +1825,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, vma == get_gate_vma(current->mm)) vma->vm_flags &= VM_LOCKED_CLEAR_MASK; else - mm->locked_vm += (len >> PAGE_SHIFT); + atomic_long_add(len >> PAGE_SHIFT, &mm->locked_vm); } if (file) @@ -2291,7 +2291,7 @@ static int acct_stack_growth(struct vm_area_struct *vma, if (vma->vm_flags & VM_LOCKED) { unsigned long locked; unsigned long limit; - locked = mm->locked_vm + grow; + locked = atomic_long_read(&mm->locked_vm) + grow; limit = rlimit(RLIMIT_MEMLOCK); limit >>= PAGE_SHIFT; if (locked > limit && !capable(CAP_IPC_LOCK)) @@ -2385,7 +2385,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) */ spin_lock(&mm->page_table_lock); if (vma->vm_flags & VM_LOCKED) - mm->locked_vm += grow; + atomic_long_add(grow, &mm->locked_vm); vm_stat_account(mm, vma->vm_flags, grow); anon_vma_interval_tree_pre_update_vma(vma); vma->vm_end = address; @@ -2466,7 +2466,7 @@ int expand_downwards(struct vm_area_struct *vma, */ spin_lock(&mm->page_table_lock); if (vma->vm_flags & VM_LOCKED) - mm->locked_vm += grow; + atomic_long_add(grow, &mm->locked_vm); vm_stat_account(mm, vma->vm_flags, grow); anon_vma_interval_tree_pre_update_vma(vma); vma->vm_start = address; @@ -2787,11 +2787,11 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len, /* * unlock any mlock()ed ranges before detaching vmas */ - if (mm->locked_vm) { + if (atomic_long_read(&mm->locked_vm)) { struct vm_area_struct *tmp = vma; while (tmp && tmp->vm_start < end) { if (tmp->vm_flags & VM_LOCKED) { - mm->locked_vm -= vma_pages(tmp); + atomic_long_sub(vma_pages(tmp), &mm->locked_vm); munlock_vma_pages_all(tmp); } @@ -3050,7 +3050,7 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla mm->total_vm += len >> PAGE_SHIFT; mm->data_vm += len >> PAGE_SHIFT; if (flags & VM_LOCKED) - mm->locked_vm += (len >> PAGE_SHIFT); + atomic_long_add(len >> PAGE_SHIFT, &mm->locked_vm); vma->vm_flags |= VM_SOFTDIRTY; return 0; } @@ -3122,7 +3122,7 @@ void exit_mmap(struct mm_struct *mm) up_write(&mm->mmap_sem); } - if (mm->locked_vm) { + if (atomic_long_read(&mm->locked_vm)) { vma = mm->mmap; while (vma) { if (vma->vm_flags & VM_LOCKED) diff --git a/mm/mremap.c b/mm/mremap.c index 7f9f9180e401..abcc58690f17 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -360,7 +360,7 @@ static unsigned long move_vma(struct vm_area_struct *vma, } if (vm_flags & VM_LOCKED) { - mm->locked_vm += new_len >> PAGE_SHIFT; + atomic_long_add(new_len >> PAGE_SHIFT, &mm->locked_vm); *locked = true; } @@ -411,7 +411,7 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr, if (vma->vm_flags & VM_LOCKED) { unsigned long locked, lock_limit; - locked = mm->locked_vm << PAGE_SHIFT; + locked = atomic_long_read(&mm->locked_vm) << PAGE_SHIFT; lock_limit = rlimit(RLIMIT_MEMLOCK); locked += new_len - old_len; if (locked > lock_limit && !capable(CAP_IPC_LOCK)) @@ -600,7 +600,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, vm_stat_account(mm, vma->vm_flags, pages); if (vma->vm_flags & VM_LOCKED) { - mm->locked_vm += pages; + atomic_long_add(pages, &mm->locked_vm); locked = true; new_addr = addr; } From patchwork Mon Nov 5 16:55:53 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668721 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 439F314BD for ; Mon, 5 Nov 2018 16:59:08 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3212E29B8D for ; Mon, 5 Nov 2018 16:59:08 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 236F829BA0; Mon, 5 Nov 2018 16:59:08 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CB7FE29B8D for ; Mon, 5 Nov 2018 16:59:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387773AbeKFCTf (ORCPT ); Mon, 5 Nov 2018 21:19:35 -0500 Received: from userp2120.oracle.com ([156.151.31.85]:43552 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387678AbeKFCR1 (ORCPT ); Mon, 5 Nov 2018 21:17:27 -0500 Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5Gs4bK052321; Mon, 5 Nov 2018 16:56:26 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2018-07-02; bh=pFJTplxowb9ZeUGTgNoJLdHdBNDs48jVBVOWfuQrUbk=; b=UEDXi9JQkI61IrVjTEL9K2GK2kUyKmRXKt4ss+WNs6+EK0pNVLSyhLeGQntyiJM5cEIY 1QTs2L+45uBq0NzchVRwPJe6uRKAwuGgqm6hOlQix9uyUSXu592okWo3iSWYTsl/qaw3 8mNuLrR1FcUPpxeOoXm4mDjSV9FxagWh8aPyOdKt8PLonHdEZfXgPE4AltLy3UCoTOeK jUog8/C/ZgDbE8JAsDhSMmGritc7Bt190iR/MqdkBatkrFyrFLNaYl5BAZlh0Py/U64X SJ8E5YAvEUlYnM8jGUjYCG6xNTBhW9MI0wXB7jXzKedma3SvDo9hFnWSZZv5KHpQC47Q JQ== Received: from userv0022.oracle.com (userv0022.oracle.com [156.151.31.74]) by userp2120.oracle.com with ESMTP id 2nh4aqg2e3-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:26 +0000 Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by userv0022.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuKSY024985 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:20 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id wA5GuJFq008435; Mon, 5 Nov 2018 16:56:19 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:19 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 08/13] vfio: remove unnecessary mmap_sem writer acquisition around locked_vm Date: Mon, 5 Nov 2018 11:55:53 -0500 Message-Id: <20181105165558.11698-9-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Now that mmap_sem is no longer required for modifying locked_vm, remove it in the VFIO code. [XXX Can be sent separately, along with similar conversions in the other places mmap_sem was taken for locked_vm. While at it, could make similar changes to pinned_vm.] Signed-off-by: Daniel Jordan --- drivers/vfio/vfio_iommu_type1.c | 26 +++++++++----------------- 1 file changed, 9 insertions(+), 17 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index f307dc9d5e19..9e52a24eb45b 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -258,7 +258,8 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn) static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) { struct mm_struct *mm; - int ret; + long locked_vm; + int ret = 0; if (!npage) return 0; @@ -267,24 +268,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) if (!mm) return -ESRCH; /* process exited */ - ret = down_write_killable(&mm->mmap_sem); - if (!ret) { - if (npage > 0) { - if (!dma->lock_cap) { - unsigned long limit; - - limit = task_rlimit(dma->task, - RLIMIT_MEMLOCK) >> PAGE_SHIFT; + locked_vm = atomic_long_add_return(npage, &mm->locked_vm); - if (atomic_long_read(&mm->locked_vm) + npage > limit) - ret = -ENOMEM; - } + if (npage > 0 && !dma->lock_cap) { + unsigned long limit = task_rlimit(dma->task, RLIMIT_MEMLOCK) >> + PAGE_SHIFT; + if (locked_vm > limit) { + atomic_long_sub(npage, &mm->locked_vm); + ret = -ENOMEM; } - - if (!ret) - atomic_long_add(npage, &mm->locked_vm); - - up_write(&mm->mmap_sem); } if (async) From patchwork Mon Nov 5 16:55:54 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668725 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 20CD814E2 for ; Mon, 5 Nov 2018 16:59:16 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0EEF929B8D for ; Mon, 5 Nov 2018 16:59:16 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 02E1729BA0; Mon, 5 Nov 2018 16:59:15 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5D22429B8D for ; Mon, 5 Nov 2018 16:59:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387697AbeKFCTf (ORCPT ); Mon, 5 Nov 2018 21:19:35 -0500 Received: from userp2120.oracle.com ([156.151.31.85]:43556 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387603AbeKFCR1 (ORCPT ); Mon, 5 Nov 2018 21:17:27 -0500 Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5GsYIi052564; Mon, 5 Nov 2018 16:56:23 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type : content-transfer-encoding; s=corp-2018-07-02; bh=jxkC7rLLT5JE0Jtpp5a2UGIfbNY+hhiWsmmEye+YCsw=; b=T8iqLWIcT6mSMpnkQPq7gWEjNj1BYI7X7FJ73FI5DSg73jz2tZmbQk8hmJQpb4VJs0Yo C8HEbEo1PNSlzf6akxFswkhc7suHjR+pUMdR3ycD9+4ChKwCi+YMPAKzVtvYxzF6E54T ormXD6ufwf5U5YAV39XORznSYUVfH2hGcGdogyEGasAoqQAighyMFXGs//6idJzAYKdL 3uHzOm6ntgeXU0zyzLGaFTyDbqpIduJp6qfRJT2eBhZcgAwH1ftyiF1b6GiBZ28m4LKh ITXmyL2q/0Xn6jRqYvByMIvI5LUzUeOeUelO5/YqiphEpWaGlua6ysZLjICtUcp0uG/O fw== Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by userp2120.oracle.com with ESMTP id 2nh4aqg2ds-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:23 +0000 Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuMfm004956 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:22 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id wA5GuLRt008448; Mon, 5 Nov 2018 16:56:21 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:21 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 09/13] vfio: relieve mmap_sem reader cacheline bouncing by holding it longer Date: Mon, 5 Nov 2018 11:55:54 -0500 Message-Id: <20181105165558.11698-10-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Profiling shows significant time being spent on atomic ops in mmap_sem reader acquisition. mmap_sem is taken and dropped for every single base page during pinning, so this is not surprising. Reduce the number of times mmap_sem is taken by holding for longer, which relieves atomic cacheline bouncing. Results for all VFIO page pinning patches ----------------------------------------- The test measures the time from qemu invocation to the start of guest boot. The guest uses kvm with 320G memory backed with THP. 320G fits in a node on the test machine used here, so there was no thrashing in reclaim because of __GFP_THISNODE in THP allocations[1]. CPU: 2 nodes * 24 cores/node * 2 threads/core = 96 CPUs Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz memory: 754G split evenly between nodes scaling_governor: performance patch 6 patch 8 patch 9 (this one) ----------------------- ------------------------ ------------------------ thr speedup average sec speedup average sec speedup average sec 1 65.0 (± 0.6%) 65.2 (± 0.5%) 65.5 (± 0.4%) 2 1.5x 42.8 (± 5.8%) 1.8x 36.0 (± 0.9%) 1.9x 34.4 (± 0.3%) 3 1.9x 35.0 (±11.3%) 2.5x 26.4 (± 4.2%) 2.8x 23.7 (± 0.2%) 4 2.3x 28.5 (± 1.3%) 3.1x 21.2 (± 2.8%) 3.6x 18.3 (± 0.3%) 5 2.5x 26.2 (± 1.5%) 3.6x 17.9 (± 0.9%) 4.3x 15.1 (± 0.3%) 6 2.7x 24.5 (± 1.8%) 4.0x 16.5 (± 3.0%) 5.1x 12.9 (± 0.1%) 7 2.8x 23.5 (± 4.9%) 4.2x 15.4 (± 2.7%) 5.7x 11.5 (± 0.6%) 8 2.8x 22.8 (± 1.8%) 4.2x 15.5 (± 4.7%) 6.4x 10.3 (± 0.8%) 12 3.2x 20.2 (± 1.4%) 4.4x 14.7 (± 2.9%) 8.6x 7.6 (± 0.6%) 16 3.3x 20.0 (± 0.7%) 4.3x 15.4 (± 1.3%) 10.2x 6.4 (± 0.6%) At patch 6, lock_stat showed long reader wait time on mmap_sem writers, leading to patch 8. At patch 8, profiling revealed the issue with mmap_sem described above. Across all three patches, performance consistently improves as the thread count increases. The one exception is the antiscaling with nthr=16 in patch 8: those mmap_sem atomics are really bouncing around the machine. The performance with patch 9 looks pretty good overall. I'm working on finding the next bottleneck, and this is where it stopped: When nthr=16, the obvious issue profiling showed was contention on the split PMD page table lock when pages are faulted in during the pinning (>2% of the time). A split PMD lock protects a PUD_SIZE-ed amount of page table mappings (1G on x86), so if threads were operating on smaller chunks and contending in the same PUD_SIZE range, this could be the source of contention. However, when nthr=16, threads operate on 5G chunks (320G / 16 threads / (1< --- drivers/vfio/vfio_iommu_type1.c | 28 +++++++++++++++++++--------- 1 file changed, 19 insertions(+), 9 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 9e52a24eb45b..0d6ec0369de5 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -331,7 +331,7 @@ static int put_pfn(unsigned long pfn, int prot) } static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, - int prot, unsigned long *pfn) + int prot, unsigned long *pfn, bool handle_mmap_sem) { struct page *page[1]; struct vm_area_struct *vma; @@ -342,7 +342,8 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, if (prot & IOMMU_WRITE) flags |= FOLL_WRITE; - down_read(&mm->mmap_sem); + if (handle_mmap_sem) + down_read(&mm->mmap_sem); if (mm == current->mm) { ret = get_user_pages_longterm(vaddr, 1, flags, page, vmas); } else { @@ -360,14 +361,16 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, put_page(page[0]); } } - up_read(&mm->mmap_sem); + if (handle_mmap_sem) + up_read(&mm->mmap_sem); if (ret == 1) { *pfn = page_to_pfn(page[0]); return 0; } - down_read(&mm->mmap_sem); + if (handle_mmap_sem) + down_read(&mm->mmap_sem); vma = find_vma_intersection(mm, vaddr, vaddr + 1); @@ -377,7 +380,8 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, ret = 0; } - up_read(&mm->mmap_sem); + if (handle_mmap_sem) + up_read(&mm->mmap_sem); return ret; } @@ -399,9 +403,12 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, if (!mm) return -ENODEV; - ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base); - if (ret) + down_read(&mm->mmap_sem); + ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base, false); + if (ret) { + up_read(&mm->mmap_sem); return ret; + } pinned++; rsvd = is_invalid_reserved_pfn(*pfn_base); @@ -416,6 +423,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, put_pfn(*pfn_base, dma->prot); pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__, limit << PAGE_SHIFT); + up_read(&mm->mmap_sem); return -ENOMEM; } lock_acct++; @@ -427,7 +435,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, /* Lock all the consecutive pages from pfn_base */ for (vaddr += PAGE_SIZE, iova += PAGE_SIZE; pinned < npage; pinned++, vaddr += PAGE_SIZE, iova += PAGE_SIZE) { - ret = vaddr_get_pfn(mm, vaddr, dma->prot, &pfn); + ret = vaddr_get_pfn(mm, vaddr, dma->prot, &pfn, false); if (ret) break; @@ -444,6 +452,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__, limit << PAGE_SHIFT); ret = -ENOMEM; + up_read(&mm->mmap_sem); goto unpin_out; } lock_acct++; @@ -451,6 +460,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, } out: + up_read(&mm->mmap_sem); ret = vfio_lock_acct(dma, lock_acct, false); unpin_out: @@ -497,7 +507,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr, if (!mm) return -ENODEV; - ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base); + ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base, true); if (!ret && do_accounting && !is_invalid_reserved_pfn(*pfn_base)) { ret = vfio_lock_acct(dma, 1, true); if (ret) { From patchwork Mon Nov 5 16:55:55 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668669 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 19736175A for ; Mon, 5 Nov 2018 16:57:04 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 08972293D9 for ; Mon, 5 Nov 2018 16:57:04 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id EFFC429463; Mon, 5 Nov 2018 16:57:03 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A2F5D2946D for ; Mon, 5 Nov 2018 16:57:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387723AbeKFCRa (ORCPT ); Mon, 5 Nov 2018 21:17:30 -0500 Received: from userp2130.oracle.com ([156.151.31.86]:38834 "EHLO userp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387677AbeKFCR3 (ORCPT ); Mon, 5 Nov 2018 21:17:29 -0500 Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5Gs0qK097046; Mon, 5 Nov 2018 16:56:25 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2018-07-02; bh=a2OiRCY8BwYFe4eDIMRKDWwH3Sf4ADRCu0KAZGhXWg0=; b=KAhi1q2wPZLA+lEK3/3KyAFRrNkzLqmyxyDpGjQBIRkQwSlyk2/8oO9cw6hBVtplTD0Y Fr8lywyV+UoD7H5NFj17FJ42jAJbGq2lg05a59C8u/REeu+F4N7MwoFCUpeubJBVUKYX 4N7PDIBXJBmuJNPIuWdfFC39Cj+QCYHBlaGg/AFPGU+mJgSMkbb/ll62cYJUFdZ5JoYJ Ui41GCe9B+4kFzh7uMqR7qSdRd4dxCpKg+h9km1Z3MJ1Mdzmd1KfAzxCf3ax3bMAWOYy IkW/pkXENgdYH1+GwlT+AYH1b0Tq+o1epsWlUIMJJdcANSEpQiigwKPy+wUK55U6482m Ag== Received: from userv0022.oracle.com (userv0022.oracle.com [156.151.31.74]) by userp2130.oracle.com with ESMTP id 2nh33tr6bp-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:24 +0000 Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by userv0022.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuOXj025118 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:24 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id wA5GuNaR008457; Mon, 5 Nov 2018 16:56:23 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:23 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 10/13] mm: enlarge type of offset argument in mem_map_offset and mem_map_next Date: Mon, 5 Nov 2018 11:55:55 -0500 Message-Id: <20181105165558.11698-11-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=778 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Changes the type of 'offset' from int to unsigned long in both mem_map_offset and mem_map_next. This facilitates ktask's use of mem_map_next with its unsigned long types to avoid silent truncation when these unsigned longs are passed as ints. It also fixes the preexisting truncation of 'offset' from unsigned long to int by the sole caller of mem_map_offset, follow_hugetlb_page. Signed-off-by: Daniel Jordan --- mm/internal.h | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 3b1ec1412fd2..cc90de4d4e01 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -367,7 +367,8 @@ static inline void mlock_migrate_page(struct page *new, struct page *old) { } * the maximally aligned gigantic page 'base'. Handle any discontiguity * in the mem_map at MAX_ORDER_NR_PAGES boundaries. */ -static inline struct page *mem_map_offset(struct page *base, int offset) +static inline struct page *mem_map_offset(struct page *base, + unsigned long offset) { if (unlikely(offset >= MAX_ORDER_NR_PAGES)) return nth_page(base, offset); @@ -378,8 +379,8 @@ static inline struct page *mem_map_offset(struct page *base, int offset) * Iterator over all subpages within the maximally aligned gigantic * page 'base'. Handle any discontiguity in the mem_map. */ -static inline struct page *mem_map_next(struct page *iter, - struct page *base, int offset) +static inline struct page *mem_map_next(struct page *iter, struct page *base, + unsigned long offset) { if (unlikely((offset & (MAX_ORDER_NR_PAGES - 1)) == 0)) { unsigned long pfn = page_to_pfn(base) + offset; From patchwork Mon Nov 5 16:55:56 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668733 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E412414BD for ; Mon, 5 Nov 2018 16:59:32 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D14E929B8D for ; Mon, 5 Nov 2018 16:59:32 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id C48DA29BA0; Mon, 5 Nov 2018 16:59:32 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2047229B8D for ; Mon, 5 Nov 2018 16:59:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387796AbeKFCT7 (ORCPT ); Mon, 5 Nov 2018 21:19:59 -0500 Received: from userp2130.oracle.com ([156.151.31.86]:38800 "EHLO userp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387643AbeKFCRZ (ORCPT ); Mon, 5 Nov 2018 21:17:25 -0500 Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5Gs5AK097190; Mon, 5 Nov 2018 16:56:26 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2018-07-02; bh=gmXHeNTucBAhP5oOOIcBjLrTNKwY0b2G6CBWPMLTzgY=; b=U8hcULPwR1YXjpnkRNNkZpVOXWU/iuEeoayHLOibvlpvwD2FGOJnGTMgEgA5eTw1DbsH 1S6pGD1I2f50NUSo032DTf3iDHCtq116VySLP+wgy5MjHt5+noiydGbdMavzGqq6MQHe khYC4oYTcWvlJkb2d4WE2BEx3tnHsdBwDVlN8a6vju9tG7lNuSZ9TLqPe//nwRkAwKaM mSqOzUl3Pi7Jzo+qMFqcxrVvI35mTOfRCXOLSizTZjsh7r4Aqbk8nKZmfp/xcRAwDglj 8GZuCo4rFzXVnqeINsOea8+2fWI6VY1o3TW3Qr5bMe6CM8addCfcPxvy/ONPJaUuLqLc 2A== Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by userp2130.oracle.com with ESMTP id 2nh33tr6bv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:26 +0000 Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuPw2005090 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:25 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id wA5GuP8O018749; Mon, 5 Nov 2018 16:56:25 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:25 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node Date: Mon, 5 Nov 2018 11:55:56 -0500 Message-Id: <20181105165558.11698-12-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=2 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Deferred struct page initialization currently runs one thread per node, but this is a bottleneck during boot on big machines, so use ktask within each pgdatinit thread to parallelize the struct page initialization, allowing the system to take better advantage of its memory bandwidth. Because the system is not fully up yet and most CPUs are idle, use more than the default maximum number of ktask threads. The kernel doesn't know the memory bandwidth of a given system to get the most efficient number of threads, so there's some guesswork involved. In testing, a reasonable value turned out to be about a quarter of the CPUs on the node. __free_pages_core used to increase the zone's managed page count by the number of pages being freed. To accommodate multiple threads, however, account the number of freed pages with an atomic shared across the ktask threads and bump the managed page count with it after ktask is finished. Test: Boot the machine with deferred struct page init three times Machine: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz, 88 CPUs, 503G memory, 2 sockets kernel speedup max time per stdev node (ms) baseline (4.15-rc2) 5860 8.6 ktask 9.56x 613 12.4 --- Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory 8 sockets kernel speedup max time per stdev node (ms) baseline (4.15-rc2) 1261 1.9 ktask 3.88x 325 5.0 Signed-off-by: Daniel Jordan Suggested-by: Pavel Tatashin --- mm/page_alloc.c | 91 ++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 78 insertions(+), 13 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ae31839874b8..fe7b681567ba 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -66,6 +66,7 @@ #include #include #include +#include #include #include @@ -1275,7 +1276,6 @@ void __free_pages_core(struct page *page, unsigned int order) set_page_count(p, 0); } - page_zone(page)->managed_pages += nr_pages; set_page_refcounted(page); __free_pages(page, order); } @@ -1340,6 +1340,7 @@ void __init memblock_free_pages(struct page *page, unsigned long pfn, if (early_page_uninitialised(pfn)) return; __free_pages_core(page, order); + page_zone(page)->managed_pages += 1UL << order; } /* @@ -1477,23 +1478,31 @@ deferred_pfn_valid(int nid, unsigned long pfn, return true; } +struct deferred_args { + int nid; + int zid; + atomic64_t nr_pages; +}; + /* * Free pages to buddy allocator. Try to free aligned pages in * pageblock_nr_pages sizes. */ -static void __init deferred_free_pages(int nid, int zid, unsigned long pfn, - unsigned long end_pfn) +static int __init deferred_free_pages(int nid, int zid, unsigned long pfn, + unsigned long end_pfn) { struct mminit_pfnnid_cache nid_init_state = { }; unsigned long nr_pgmask = pageblock_nr_pages - 1; - unsigned long nr_free = 0; + unsigned long nr_free = 0, nr_pages = 0; for (; pfn < end_pfn; pfn++) { if (!deferred_pfn_valid(nid, pfn, &nid_init_state)) { deferred_free_range(pfn - nr_free, nr_free); + nr_pages += nr_free; nr_free = 0; } else if (!(pfn & nr_pgmask)) { deferred_free_range(pfn - nr_free, nr_free); + nr_pages += nr_free; nr_free = 1; touch_nmi_watchdog(); } else { @@ -1502,16 +1511,27 @@ static void __init deferred_free_pages(int nid, int zid, unsigned long pfn, } /* Free the last block of pages to allocator */ deferred_free_range(pfn - nr_free, nr_free); + nr_pages += nr_free; + + return nr_pages; +} + +static int __init deferred_free_chunk(unsigned long pfn, unsigned long end_pfn, + struct deferred_args *args) +{ + unsigned long nr_pages = deferred_free_pages(args->nid, args->zid, pfn, + end_pfn); + atomic64_add(nr_pages, &args->nr_pages); + return KTASK_RETURN_SUCCESS; } /* * Initialize struct pages. We minimize pfn page lookups and scheduler checks * by performing it only once every pageblock_nr_pages. - * Return number of pages initialized. + * Return number of pages initialized in deferred_args. */ -static unsigned long __init deferred_init_pages(int nid, int zid, - unsigned long pfn, - unsigned long end_pfn) +static int __init deferred_init_pages(int nid, int zid, unsigned long pfn, + unsigned long end_pfn) { struct mminit_pfnnid_cache nid_init_state = { }; unsigned long nr_pgmask = pageblock_nr_pages - 1; @@ -1531,7 +1551,17 @@ static unsigned long __init deferred_init_pages(int nid, int zid, __init_single_page(page, pfn, zid, nid); nr_pages++; } - return (nr_pages); + + return nr_pages; +} + +static int __init deferred_init_chunk(unsigned long pfn, unsigned long end_pfn, + struct deferred_args *args) +{ + unsigned long nr_pages = deferred_init_pages(args->nid, args->zid, pfn, + end_pfn); + atomic64_add(nr_pages, &args->nr_pages); + return KTASK_RETURN_SUCCESS; } /* Initialise remaining memory on a node */ @@ -1540,13 +1570,15 @@ static int __init deferred_init_memmap(void *data) pg_data_t *pgdat = data; int nid = pgdat->node_id; unsigned long start = jiffies; - unsigned long nr_pages = 0; + unsigned long nr_init = 0, nr_free = 0; unsigned long spfn, epfn, first_init_pfn, flags; phys_addr_t spa, epa; int zid; struct zone *zone; const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id); u64 i; + unsigned long nr_node_cpus; + struct ktask_node kn; /* Bind memory initialisation thread to a local node if possible */ if (!cpumask_empty(cpumask)) @@ -1560,6 +1592,14 @@ static int __init deferred_init_memmap(void *data) return 0; } + /* + * We'd like to know the memory bandwidth of the chip to calculate the + * most efficient number of threads to start, but we can't. In + * testing, a good value for a variety of systems was a quarter of the + * CPUs on the node. + */ + nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4); + /* Sanity check boundaries */ BUG_ON(pgdat->first_deferred_pfn < pgdat->node_start_pfn); BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat)); @@ -1580,21 +1620,46 @@ static int __init deferred_init_memmap(void *data) * page in __free_one_page()). */ for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) { + struct deferred_args args = { nid, zid, ATOMIC64_INIT(0) }; + DEFINE_KTASK_CTL(ctl, deferred_init_chunk, &args, + KTASK_PTE_MINCHUNK); + ktask_ctl_set_max_threads(&ctl, nr_node_cpus); + spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa)); epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa)); - nr_pages += deferred_init_pages(nid, zid, spfn, epfn); + + kn.kn_start = (void *)spfn; + kn.kn_task_size = (spfn < epfn) ? epfn - spfn : 0; + kn.kn_nid = nid; + (void) ktask_run_numa(&kn, 1, &ctl); + + nr_init += atomic64_read(&args.nr_pages); } for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) { + struct deferred_args args = { nid, zid, ATOMIC64_INIT(0) }; + DEFINE_KTASK_CTL(ctl, deferred_free_chunk, &args, + KTASK_PTE_MINCHUNK); + ktask_ctl_set_max_threads(&ctl, nr_node_cpus); + spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa)); epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa)); - deferred_free_pages(nid, zid, spfn, epfn); + + kn.kn_start = (void *)spfn; + kn.kn_task_size = (spfn < epfn) ? epfn - spfn : 0; + kn.kn_nid = nid; + (void) ktask_run_numa(&kn, 1, &ctl); + + nr_free += atomic64_read(&args.nr_pages); } pgdat_resize_unlock(pgdat, &flags); /* Sanity check that the next zone really is unpopulated */ WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone)); + VM_BUG_ON(nr_init != nr_free); + + zone->managed_pages += nr_free; - pr_info("node %d initialised, %lu pages in %ums\n", nid, nr_pages, + pr_info("node %d initialised, %lu pages in %ums\n", nid, nr_free, jiffies_to_msecs(jiffies - start)); pgdat_init_report_one_done(); From patchwork Mon Nov 5 16:55:57 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668727 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C9A9614BD for ; Mon, 5 Nov 2018 16:59:19 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B7A0529B8D for ; Mon, 5 Nov 2018 16:59:19 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id ABC7029BA0; Mon, 5 Nov 2018 16:59:19 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4BF5929B8D for ; Mon, 5 Nov 2018 16:59:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387704AbeKFCR1 (ORCPT ); Mon, 5 Nov 2018 21:17:27 -0500 Received: from userp2120.oracle.com ([156.151.31.85]:43550 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387677AbeKFCR0 (ORCPT ); Mon, 5 Nov 2018 21:17:26 -0500 Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5Gs5sA052327; Mon, 5 Nov 2018 16:56:28 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2018-07-02; bh=UTB2Sr3dL/IqP1KEApaT4aDh/kChWZlvDCupI65rZ48=; b=0QHz6k1R+f97qO32+2IlnJ35k/qUdjYjcuv1TKJ4wUIXfkE6+JNJj0qrM0nghND3+w+O cAjLW1fqp98CLgSQLbB6wBLogI18brwjKrrfhq11qDMjTejr785PdYVoH1VxZ2Z/Z0Ze IydMcrlOPWbdmdONU9w42O4UZXjZbXyGe439Fz1ate0YvQ/+SB7ln/O45xJFihP1AfNJ gpG1cyhtQippKedqJxaNV46cYx78VZ6QmEvxrTeeunagc78IXJs7JRQgsruiD7iRwojV 3sUN/yabo7coxLf76IbIeC66BuznidvengcvZ1/T2h08L3xA3Af1c93X59WstgN/g6Ru bg== Received: from aserv0022.oracle.com (aserv0022.oracle.com [141.146.126.234]) by userp2120.oracle.com with ESMTP id 2nh4aqg2e6-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:28 +0000 Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by aserv0022.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuRaO022744 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:27 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id wA5GuRg4018838; Mon, 5 Nov 2018 16:56:27 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:26 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 12/13] mm: parallelize clear_gigantic_page Date: Mon, 5 Nov 2018 11:55:57 -0500 Message-Id: <20181105165558.11698-13-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Parallelize clear_gigantic_page, which zeroes any page size larger than 8M (e.g. 1G on x86). Performance results (the default number of threads is 4; higher thread counts shown for context only): Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory Test: Clear a range of gigantic pages (triggered via fallocate) nthread speedup size (GiB) min time (s) stdev 1 100 41.13 0.03 2 2.03x 100 20.26 0.14 4 4.28x 100 9.62 0.09 8 8.39x 100 4.90 0.05 16 10.44x 100 3.94 0.03 1 200 89.68 0.35 2 2.21x 200 40.64 0.18 4 4.64x 200 19.33 0.32 8 8.99x 200 9.98 0.04 16 11.27x 200 7.96 0.04 1 400 188.20 1.57 2 2.30x 400 81.84 0.09 4 4.63x 400 40.62 0.26 8 8.92x 400 21.09 0.50 16 11.78x 400 15.97 0.25 1 800 434.91 1.81 2 2.54x 800 170.97 1.46 4 4.98x 800 87.38 1.91 8 10.15x 800 42.86 2.59 16 12.99x 800 33.48 0.83 The speedups are mostly due to the fact that more threads can use more memory bandwidth. The loop we're stressing on the x86 chip in this test is clear_page_erms, which tops out at a bandwidth of 2550 MiB/s with one thread. We get the same bandwidth per thread for 2, 4, or 8 threads, but at 16 threads the per-thread bandwidth drops to 1420 MiB/s. However, the performance also improves over a single thread because of the ktask threads' NUMA awareness (ktask migrates worker threads to the node local to the work being done). This becomes a bigger factor as the amount of pages to zero grows to include memory from multiple nodes, so that speedups increase as the size increases. Signed-off-by: Daniel Jordan --- mm/memory.c | 32 ++++++++++++++++++++++++-------- 1 file changed, 24 insertions(+), 8 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 15c417e8e31d..445d06537905 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -69,6 +69,7 @@ #include #include #include +#include #include #include @@ -4415,19 +4416,28 @@ static inline void process_huge_page( } } -static void clear_gigantic_page(struct page *page, - unsigned long addr, - unsigned int pages_per_huge_page) +struct cgp_args { + struct page *base_page; + unsigned long addr; +}; + +static int clear_gigantic_page_chunk(unsigned long start, unsigned long end, + struct cgp_args *args) { - int i; - struct page *p = page; + struct page *base_page = args->base_page; + struct page *p = base_page; + unsigned long addr = args->addr; + unsigned long i; might_sleep(); - for (i = 0; i < pages_per_huge_page; - i++, p = mem_map_next(p, page, i)) { + for (i = start; i < end; ++i) { cond_resched(); clear_user_highpage(p, addr + i * PAGE_SIZE); + + p = mem_map_next(p, base_page, i); } + + return KTASK_RETURN_SUCCESS; } static void clear_subpage(unsigned long addr, int idx, void *arg) @@ -4444,7 +4454,13 @@ void clear_huge_page(struct page *page, ~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1); if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) { - clear_gigantic_page(page, addr, pages_per_huge_page); + struct cgp_args args = {page, addr}; + struct ktask_node node = {0, pages_per_huge_page, + page_to_nid(page)}; + DEFINE_KTASK_CTL(ctl, clear_gigantic_page_chunk, &args, + KTASK_MEM_CHUNK); + + ktask_run_numa(&node, 1, &ctl); return; } From patchwork Mon Nov 5 16:55:58 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668703 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6F20614BD for ; Mon, 5 Nov 2018 16:58:05 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5D51928A75 for ; Mon, 5 Nov 2018 16:58:05 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 510D629463; Mon, 5 Nov 2018 16:58:05 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AA4C328A75 for ; Mon, 5 Nov 2018 16:58:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387969AbeKFCSf (ORCPT ); Mon, 5 Nov 2018 21:18:35 -0500 Received: from aserp2120.oracle.com ([141.146.126.78]:50884 "EHLO aserp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387921AbeKFCSf (ORCPT ); Mon, 5 Nov 2018 21:18:35 -0500 Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1]) by aserp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5Gs16d104348; Mon, 5 Nov 2018 16:56:29 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2018-07-02; bh=v1Z2bulRx1p8NREQZCqWa146dHZ3uwVyRP02cw0dAu8=; b=LJey9cVLjtUWlT8pv+rIL3jPjjiTjdUnM/rJYx3ofq0/Pu9jh5LGg+L4HgYxyKWRcxwy H80cv6m8ZsTeHbIoKukaq7mRdt/LIR1ijuJ9KS1+rPxiTMvaqdHBN5UZLl0S/BeZ7xAt m+I3MdsKvofx4UGQWKwRva68Dlz3bJSnVoBBvlyTxW0hnZWhqUNV/MTNZZU6tWAIbOyg 7rYHNNCMRcJiB/4k9v2YgevLe3EmtKH4RhoqirRRf8sPktsptI6NMV4i2ed3GLIQWd2T POVsu5AkmXiCZ6JcC+IX2Kaa/OgRzq33FCm8Aci7faPGG6WJ8IVnPwsMWWo+pdPf8i9Q Xg== Received: from aserv0021.oracle.com (aserv0021.oracle.com [141.146.126.233]) by aserp2120.oracle.com with ESMTP id 2nh3mpg59a-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:29 +0000 Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236]) by aserv0021.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuTG6025303 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:29 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuTbr017989; Mon, 5 Nov 2018 16:56:29 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:28 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 13/13] hugetlbfs: parallelize hugetlbfs_fallocate with ktask Date: Mon, 5 Nov 2018 11:55:58 -0500 Message-Id: <20181105165558.11698-14-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP hugetlbfs_fallocate preallocates huge pages to back a file in a hugetlbfs filesystem. The time to call this function grows linearly with size. ktask performs well with its default thread count of 4; higher thread counts are given for context only. Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory Test: fallocate(1) a file on a hugetlbfs filesystem nthread speedup size (GiB) min time (s) stdev 1 200 127.53 2.19 2 3.09x 200 41.30 2.11 4 5.72x 200 22.29 0.51 8 9.45x 200 13.50 2.58 16 9.74x 200 13.09 1.64 1 400 193.09 2.47 2 2.14x 400 90.31 3.39 4 3.84x 400 50.32 0.44 8 5.11x 400 37.75 1.23 16 6.12x 400 31.54 3.13 The primary bottleneck for better scaling at higher thread counts is hugetlb_fault_mutex_table[hash]. perf showed L1-dcache-loads increase with 8 threads and again sharply with 16 threads, and a CPU counter profile showed that 31% of the L1d misses were on hugetlb_fault_mutex_table[hash] in the 16-thread case. Signed-off-by: Daniel Jordan --- fs/hugetlbfs/inode.c | 114 +++++++++++++++++++++++++++++++++++-------- 1 file changed, 93 insertions(+), 21 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 762028994f47..a73548a96061 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -37,6 +37,7 @@ #include #include #include +#include #include @@ -104,11 +105,16 @@ static const struct fs_parameter_description hugetlb_fs_parameters = { }; #ifdef CONFIG_NUMA +static inline struct shared_policy *hugetlb_get_shared_policy( + struct inode *inode) +{ + return &HUGETLBFS_I(inode)->policy; +} + static inline void hugetlb_set_vma_policy(struct vm_area_struct *vma, - struct inode *inode, pgoff_t index) + struct shared_policy *policy, pgoff_t index) { - vma->vm_policy = mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy, - index); + vma->vm_policy = mpol_shared_policy_lookup(policy, index); } static inline void hugetlb_drop_vma_policy(struct vm_area_struct *vma) @@ -116,8 +122,14 @@ static inline void hugetlb_drop_vma_policy(struct vm_area_struct *vma) mpol_cond_put(vma->vm_policy); } #else +static inline struct shared_policy *hugetlb_get_shared_policy( + struct inode *inode) +{ + return NULL; +} + static inline void hugetlb_set_vma_policy(struct vm_area_struct *vma, - struct inode *inode, pgoff_t index) + struct shared_policy *policy, pgoff_t index) { } @@ -576,20 +588,30 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) return 0; } +struct hf_args { + struct file *file; + struct task_struct *parent_task; + struct mm_struct *mm; + struct shared_policy *shared_policy; + struct hstate *hstate; + struct address_space *mapping; + int error; +}; + +static int hugetlbfs_fallocate_chunk(pgoff_t start, pgoff_t end, + struct hf_args *args); + static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) { struct inode *inode = file_inode(file); struct hugetlbfs_inode_info *info = HUGETLBFS_I(inode); - struct address_space *mapping = inode->i_mapping; struct hstate *h = hstate_inode(inode); - struct vm_area_struct pseudo_vma; - struct mm_struct *mm = current->mm; loff_t hpage_size = huge_page_size(h); unsigned long hpage_shift = huge_page_shift(h); - pgoff_t start, index, end; + pgoff_t start, end; + struct hf_args hf_args; int error; - u32 hash; if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) return -EOPNOTSUPP; @@ -617,16 +639,66 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, goto out; } + hf_args.file = file; + hf_args.parent_task = current; + hf_args.mm = current->mm; + hf_args.shared_policy = hugetlb_get_shared_policy(inode); + hf_args.hstate = h; + hf_args.mapping = inode->i_mapping; + hf_args.error = 0; + + if (unlikely(hstate_is_gigantic(h))) { + /* + * Use multiple threads in clear_gigantic_page instead of here, + * so just do a 1-threaded hugetlbfs_fallocate_chunk. + */ + error = hugetlbfs_fallocate_chunk(start, end, &hf_args); + } else { + DEFINE_KTASK_CTL(ctl, hugetlbfs_fallocate_chunk, + &hf_args, KTASK_PMD_MINCHUNK); + + error = ktask_run((void *)start, end - start, &ctl); + } + + if (error != KTASK_RETURN_SUCCESS && hf_args.error != -EINTR) + goto out; + + if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) + i_size_write(inode, offset + len); + inode->i_ctime = current_time(inode); +out: + inode_unlock(inode); + return error; +} + +static int hugetlbfs_fallocate_chunk(pgoff_t start, pgoff_t end, + struct hf_args *args) +{ + struct file *file = args->file; + struct task_struct *parent_task = args->parent_task; + struct mm_struct *mm = args->mm; + struct shared_policy *shared_policy = args->shared_policy; + struct hstate *h = args->hstate; + struct address_space *mapping = args->mapping; + int error = 0; + pgoff_t index; + struct vm_area_struct pseudo_vma; + loff_t hpage_size; + u32 hash; + + hpage_size = huge_page_size(h); + /* * Initialize a pseudo vma as this is required by the huge page * allocation routines. If NUMA is configured, use page index - * as input to create an allocation policy. + * as input to create an allocation policy. Each thread gets its + * own pseudo vma because mempolicies can differ by page. */ vma_init(&pseudo_vma, mm); pseudo_vma.vm_flags = (VM_HUGETLB | VM_MAYSHARE | VM_SHARED); pseudo_vma.vm_file = file; - for (index = start; index < end; index++) { + for (index = start; index < end; ++index) { /* * This is supposed to be the vaddr where the page is being * faulted in, but we have no vaddr here. @@ -641,13 +713,13 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, * fallocate(2) manpage permits EINTR; we may have been * interrupted because we are using up too much memory. */ - if (signal_pending(current)) { + if (signal_pending(parent_task) || signal_pending(current)) { error = -EINTR; - break; + goto err; } /* Set numa allocation policy based on index */ - hugetlb_set_vma_policy(&pseudo_vma, inode, index); + hugetlb_set_vma_policy(&pseudo_vma, shared_policy, index); /* addr is the offset within the file (zero based) */ addr = index * hpage_size; @@ -672,7 +744,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, if (IS_ERR(page)) { mutex_unlock(&hugetlb_fault_mutex_table[hash]); error = PTR_ERR(page); - goto out; + goto err; } clear_huge_page(page, addr, pages_per_huge_page(h)); __SetPageUptodate(page); @@ -680,7 +752,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, if (unlikely(error)) { put_page(page); mutex_unlock(&hugetlb_fault_mutex_table[hash]); - goto out; + goto err; } mutex_unlock(&hugetlb_fault_mutex_table[hash]); @@ -693,11 +765,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, put_page(page); } - if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) - i_size_write(inode, offset + len); - inode->i_ctime = current_time(inode); -out: - inode_unlock(inode); + return KTASK_RETURN_SUCCESS; + +err: + args->error = error; + return error; }