From patchwork Mon Nov 5 16:55:46 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668653 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CBCAF14BD for ; Mon, 5 Nov 2018 16:56:39 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B3F2A283A8 for ; Mon, 5 Nov 2018 16:56:39 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id A55BA293D9; Mon, 5 Nov 2018 16:56:39 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AE69C283A8 for ; Mon, 5 Nov 2018 16:56:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8A40D6B026D; Mon, 5 Nov 2018 11:56:35 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 70D046B026E; Mon, 5 Nov 2018 11:56:35 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5F9486B026F; Mon, 5 Nov 2018 11:56:35 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-it1-f199.google.com (mail-it1-f199.google.com [209.85.166.199]) by kanga.kvack.org (Postfix) with ESMTP id 325336B026D for ; Mon, 5 Nov 2018 11:56:35 -0500 (EST) Received: by mail-it1-f199.google.com with SMTP id o204-v6so4091433itg.0 for ; Mon, 05 Nov 2018 08:56:35 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=t9jwkG1BYJOxTYtU5EaC5tIsAbpGvcEK3yRNGKu1E6U=; b=SJZHGjMMlSfry4pyGIQhna5HOFLIu6VFij1Np2uuQn1dd2c9xfnX/7twI7f8ZDilh9 kW+lgkIJ7HM0qUk1P1ceqHrzKRLVfvdoVqvSyLzgWjb/XbZnFFzdZOmjd82R9U4w70sE xkeCXa4VIVGS8y4wemBMSscDS14h66YrCKyarnE90UnHdRcVynhJma+N0QxIxkkrBhZ/ 63roNh8TDjS2d4Gq1gP7wY0mDjSelQpJPKwsMzibA1CzaJMyMqYmaoVxOtF809XtwkmU gjy58cmNY32QjM0Dv2/Ia6BjgeaxBwTT3TZDkOU43oV6IYzFVK8CUKypO2yu3l9HKyX0 JvIw== X-Gm-Message-State: AGRZ1gKPhj7WiL7fxTAnqB7JtDh20bgpy94IUMNIkLOB7zvgK8KFU7Z9 KQdoeuZLdG4ckMDm8eXx6wMJjtuegqpE6lRrQklubDsQz51KkUps3KSEW3kb9a0kkTwNhgSQiA1 uom8a5voL39eAnD2Oh9IUj8NauiBuFNurp+gLfrQRh2mfyJZ+WGzWvsJ2tuxRQWneKA== X-Received: by 2002:a6b:14c6:: with SMTP id 189-v6mr17756871iou.116.1541436994809; Mon, 05 Nov 2018 08:56:34 -0800 (PST) X-Google-Smtp-Source: AJdET5cdpfiF7ykUlp8Euv/dzsadDd2mYYH1F2RhJ+Mt5pssl2jvx/69r1GTPb0yTGJ/wnTAH3dK X-Received: by 2002:a6b:14c6:: with SMTP id 189-v6mr17756810iou.116.1541436993410; Mon, 05 Nov 2018 08:56:33 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541436993; cv=none; d=google.com; s=arc-20160816; b=j/DBH3SrvNytV32Vr52FdPopl7T9/c9XivtcjutJmIGWfEBsvIoE6p2Us35skeQeGS nAZYEcTJ9Jnt2VuiNUjMjrKHVxPr9BOOlb+SpXPDZoXUA9i+0JWYW12ft1Zoxs5Vc3a/ ZNIwN52vnvgajQvDsA3frQ8K01FQEyD26GD3UFCGORGmfTGx3Gu06pixT4n7x5PwK/5b ZlcAsQX3qroexJyVwUcTppeRiT5WPpnEJTXM147Y+XIVE9rBTyRJko+XcBbROT/0CJy/ stqNz9eeXrMwt8DMe5B3IiNuAQnR7oD9OdI0AK1K1UFeG2G2L2qBnxq+z9Vs8uZdoJYy XuwQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:dkim-signature; bh=t9jwkG1BYJOxTYtU5EaC5tIsAbpGvcEK3yRNGKu1E6U=; b=xvnIJZCy2jCIh9nlA+j0nv698sc4bM78o4TaoaYnLM7rapr4gIdVyrlSjUYuFmES6V T6UatvtkXolVn49W1tzCTLLG4xUOKHXBrrLYH3QxvPEAJPykI7lNgiIH/nhFQROmwl2N IN69nqBouCM5NoC1Tf6x0psE+1LLMahTJjln6zTUSOvtbmfbbzWbGnDeW2/rmg3ySRAl LWxjRi3LqJ7iUpfkXMH6koaQ9X0e/wBRFlx2RGcBFEvrQG0Tqwo4fyS71Mt6WW6cEM2C FwC6EPvQ6HMoAs7GIa1b1APoONKmGonm57wZnijUOWyipzq5tI2WQYFont/sll+k0nxm 1C6w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=bN3JF6+G; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.85 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from userp2120.oracle.com (userp2120.oracle.com. [156.151.31.85]) by mx.google.com with ESMTPS id f187-v6si29344340ioa.162.2018.11.05.08.56.33 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 05 Nov 2018 08:56:33 -0800 (PST) Received-SPF: pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.85 as permitted sender) client-ip=156.151.31.85; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=bN3JF6+G; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.85 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5Gs0Tu052185; Mon, 5 Nov 2018 16:56:08 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2018-07-02; bh=t9jwkG1BYJOxTYtU5EaC5tIsAbpGvcEK3yRNGKu1E6U=; b=bN3JF6+GW0PFh6Y065J3n2fGZaRjdcnyjI6VvfGvsaa3I+bp5wvXVtrVrX4y5gRMQFpq lUFjLJuzx0o9TyJjxhYIlBRvx2DaWrz8nceeixHyBsh0AbkPE21rRGIuPm2RGOvc/qne KriZqMD4lNlYwbOoAwBXUz/m55x+oBJNivyhaJ+7Y0mEAWJVG68pm+3ZdwDjyfo+3f5u wQ9kt5J8KW5nDpAySUcSTVHybtSno3a+KWEY00yy+K7xYu+YsUjtLw/fUnmFkmMipnw1 /duq2s61UicOe6PPegigVrbKy272Aed85Y2CTrSKIClXT/CUScsvKpbjob//nCmtAxfc Ig== Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by userp2120.oracle.com with ESMTP id 2nh4aqg2c5-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:08 +0000 Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id wA5Gu8KK004207 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:08 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id wA5Gu6eO008286; Mon, 5 Nov 2018 16:56:06 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:06 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 01/13] ktask: add documentation Date: Mon, 5 Nov 2018 11:55:46 -0500 Message-Id: <20181105165558.11698-2-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Motivates and explains the ktask API for kernel clients. Signed-off-by: Daniel Jordan --- Documentation/core-api/index.rst | 1 + Documentation/core-api/ktask.rst | 213 +++++++++++++++++++++++++++++++ 2 files changed, 214 insertions(+) create mode 100644 Documentation/core-api/ktask.rst diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index 3adee82be311..c143a280a5b1 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -18,6 +18,7 @@ Core utilities refcount-vs-atomic cpu_hotplug idr + ktask local_ops workqueue genericirq diff --git a/Documentation/core-api/ktask.rst b/Documentation/core-api/ktask.rst new file mode 100644 index 000000000000..c3c00e1f802f --- /dev/null +++ b/Documentation/core-api/ktask.rst @@ -0,0 +1,213 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +============================================ +ktask: parallelize CPU-intensive kernel work +============================================ + +:Date: November, 2018 +:Author: Daniel Jordan + + +Introduction +============ + +ktask is a generic framework for parallelizing CPU-intensive work in the +kernel. The intended use is for big machines that can use their CPU power to +speed up large tasks that can't otherwise be multithreaded in userland. The +API is generic enough to add concurrency to many different kinds of tasks--for +example, page clearing over an address range or freeing a list of pages--and +aims to save its clients the trouble of splitting up the work, choosing the +number of helper threads to use, maintaining an efficient concurrency level, +starting these threads, and load balancing the work between them. + + +Motivation +========== + +A single CPU can spend an excessive amount of time in the kernel operating on +large amounts of data. Often these situations arise during initialization- and +destruction-related tasks, where the data involved scales with system size. +These long-running jobs can slow startup and shutdown of applications and the +system itself while extra CPUs sit idle. + +To ensure that applications and the kernel continue to perform well as core +counts and memory sizes increase, the kernel harnesses these idle CPUs to +complete such jobs more quickly. + +For example, when booting a large NUMA machine, ktask uses additional CPUs that +would otherwise be idle until the machine is fully up to avoid a needless +bottleneck during system boot and allow the kernel to take advantage of unused +memory bandwidth. Similarly, when starting a large VM using VFIO, ktask takes +advantage of the VM's idle CPUs during VFIO page pinning rather than have the +VM's boot blocked on one thread doing all the work. + +ktask is not a substitute for single-threaded optimization. However, there is +a point where a single CPU hits a wall despite performance tuning, so +parallelize! + + +Concept +======= + +ktask is built on unbound workqueues to take advantage of the thread management +facilities it provides: creation, destruction, flushing, priority setting, and +NUMA affinity. + +A little terminology up front: A 'task' is the total work there is to do and a +'chunk' is a unit of work given to a thread. + +To complete a task using the ktask framework, a client provides a thread +function that is responsible for completing one chunk. The thread function is +defined in a standard way, with start and end arguments that delimit the chunk +as well as an argument that the client uses to pass data specific to the task. + +In addition, the client supplies an object representing the start of the task +and an iterator function that knows how to advance some number of units in the +task to yield another object representing the new task position. The framework +uses the start object and iterator internally to divide the task into chunks. + +Finally, the client passes the total task size and a minimum chunk size to +indicate the minimum amount of work that's appropriate to do in one chunk. The +sizes are given in task-specific units (e.g. pages, inodes, bytes). The +framework uses these sizes, along with the number of online CPUs and an +internal maximum number of threads, to decide how many threads to start and how +many chunks to divide the task into. + +For example, consider the task of clearing a gigantic page. This used to be +done in a single thread with a for loop that calls a page clearing function for +each constituent base page. To parallelize with ktask, the client first moves +the for loop to the thread function, adapting it to operate on the range passed +to the function. In this simple case, the thread function's start and end +arguments are just addresses delimiting the portion of the gigantic page to +clear. Then, where the for loop used to be, the client calls into ktask with +the start address of the gigantic page, the total size of the gigantic page, +and the thread function. Internally, ktask will divide the address range into +an appropriate number of chunks and start an appropriate number of threads to +complete these chunks. + + +Configuration +============= + +To use ktask, configure the kernel with CONFIG_KTASK=y. + +If CONFIG_KTASK=n, calls to the ktask API are simply #define'd to run the +thread function that the client provides so that the task is completed without +concurrency in the current thread. + + +Interface +========= + +.. kernel-doc:: include/linux/ktask.h + + +Resource Limits +=============== + +ktask has resource limits on the number of work items it sends to workqueue. +In ktask, a workqueue item is a thread that runs chunks of the task until the +task is finished. + +These limits support the different ways ktask uses workqueues: + - ktask_run to run threads on the calling thread's node. + - ktask_run_numa to run threads on the node(s) specified. + - ktask_run_numa with nid=NUMA_NO_NODE to run threads on any node in the + system. + +To support these different ways of queueing work while maintaining an efficient +concurrency level, we need both system-wide and per-node limits on the number +of threads. Without per-node limits, a node might become oversubscribed +despite ktask staying within the system-wide limit, and without a system-wide +limit, we can't properly account for work that can run on any node. + +The system-wide limit is based on the total number of CPUs, and the per-node +limit on the CPU count for each node. A per-node work item counts against the +system-wide limit. Workqueue's max_active can't accommodate both types of +limit, no matter how many workqueues are used, so ktask implements its own. + +If a per-node limit is reached, the work item is allowed to run anywhere on the +machine to avoid overwhelming the node. If the global limit is also reached, +ktask won't queue additional work items until we fall below the limit again. + +These limits apply only to workqueue items--that is, helper threads beyond the +one starting the task. That way, one thread per task is always allowed to run. + + +Scheduler Interaction +===================== + +Even within the resource limits, ktask must take care to run a number of +threads appropriate for the system's current CPU load. Under high CPU usage, +starting excessive helper threads may disturb other tasks, unfairly taking CPU +time away from them for the sake of an optimized kernel code path. + +ktask plays nicely in this case by setting helper threads to the lowest +scheduling priority on the system (MAX_NICE). This way, helpers' CPU time is +appropriately throttled on a busy system and other tasks are not disturbed. + +The main thread initiating the task remains at its original priority so that it +still makes progress on a busy system. + +It is possible for a helper thread to start running and then be forced off-CPU +by a higher priority thread. With the helper's CPU time curtailed by MAX_NICE, +the main thread may wait longer for the task to finish than it would have had +it not started any helpers, so to ensure forward progress at a single-threaded +pace, once the main thread is finished with all outstanding work in the task, +the main thread wills its priority to one helper thread at a time. At least +one thread will then always be running at the priority of the calling thread. + + +Cgroup Awareness +================ + +Given the potentially large amount of CPU time ktask threads may consume, they +should be aware of the cgroup of the task that called into ktask and +appropriately throttled. + +TODO: Implement cgroup-awareness in unbound workqueues. + + +Power Management +================ + +Starting additional helper threads may cause the system to consume more energy, +which is undesirable on energy-conscious devices. Therefore ktask needs to be +aware of cpufreq policies and scaling governors. + +If an energy-conscious policy is in use (e.g. powersave, conservative) on any +part of the system, that is a signal that the user has strong power management +preferences, in which case ktask is disabled. + +TODO: Implement this. + + +Backward Compatibility +====================== + +ktask is written so that existing calls to the API will be backwards compatible +should the API gain new features in the future. This is accomplished by +restricting API changes to members of struct ktask_ctl and having clients make +an opaque initialization call (DEFINE_KTASK_CTL). This initialization can then +be modified to include any new arguments so that existing call sites stay the +same. + + +Error Handling +============== + +Calls to ktask fail only if the provided thread function fails. In particular, +ktask avoids allocating memory internally during a task, so it's safe to use in +sensitive contexts. + +Tasks can fail midway through their work. To recover, the finished chunks of +work need to be undone in a task-specific way, so ktask allows clients to pass +an "undo" callback that is responsible for undoing one chunk of work. To avoid +multiple levels of error handling, this "undo" callback should not be allowed +to fail. For simplicity and because it's a slow path, undoing is not +multithreaded. + +Each call to ktask_run and ktask_run_numa returns a single value, +KTASK_RETURN_SUCCESS or a client-specific value. Since threads can fail for +different reasons, however, ktask may need the ability to return +thread-specific error information. This can be added later if needed. From patchwork Mon Nov 5 16:55:47 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668681 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BF69D14BD for ; Mon, 5 Nov 2018 16:57:24 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AC86028A75 for ; Mon, 5 Nov 2018 16:57:24 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id A079229536; Mon, 5 Nov 2018 16:57:24 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D61F628A75 for ; Mon, 5 Nov 2018 16:57:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B80326B027C; Mon, 5 Nov 2018 11:56:45 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id AB6E36B0279; Mon, 5 Nov 2018 11:56:45 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9332E6B027C; Mon, 5 Nov 2018 11:56:45 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yw1-f69.google.com (mail-yw1-f69.google.com [209.85.161.69]) by kanga.kvack.org (Postfix) with ESMTP id 58C126B0279 for ; Mon, 5 Nov 2018 11:56:45 -0500 (EST) Received: by mail-yw1-f69.google.com with SMTP id v132-v6so7862570ywb.15 for ; Mon, 05 Nov 2018 08:56:45 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=oWHxOYIT1CBQcNkQVPyZAD4fnb03uxTtVNnbd4h3cFI=; b=a5mQ618z/A3jnB23Z9kNmYrpb3s3GNauacLoP6OXB0gFToGDjxYK9RqD94zJUikW0h 0QJvWKu61Au6ujmc1RlakFg9MC2lYfGiXCach8glRFqKK4nqm5XWXU+XhO+vlhVb9fHh UTbiixdcrGGHKp/LMGFWEGk+KBwxxst3vZkcdWqt1YoWY6CF/mND0XJSLIBgp33Qql/E 4X/evkknVVBKWJ1H0azg2r8N1T7eD0jHqdzLa9Y+gL1DLM7T6bHb4ylPm5bcAYQicKyb PlJwvRIh97Gt0MWmj8OqdZCqxXDqjVlKtr8lxxqJ68MNvRodWQwUWBNfRdTDWO83kHG+ mUCQ== X-Gm-Message-State: AGRZ1gKApUt0WfDoo2FZ7rB66XbGgtug29naNzy4NMVJxvdLkHL78C93 hRQqMleLQPdhu0yDYUisBNhBfOOV/CY8JZDa4njKoVL12j5iUw0cTPXUvhCeNgPknZLnZLuMmIZ QG33tSnBMZkCOzNmgW6djuq4hu3rzobyXeEsC0jCI8fOcGElSFRE1EMicv9ZStb9mPg== X-Received: by 2002:a81:4fcb:: with SMTP id d194-v6mr21863729ywb.400.1541437004998; Mon, 05 Nov 2018 08:56:44 -0800 (PST) X-Google-Smtp-Source: AJdET5d838OxiR1dFwJ9uLqVo6nWUcjlEqNUXi9PF3ooIcXHUV+VzxZvokdMTxwLCMgMDUdS1vRF X-Received: by 2002:a81:4fcb:: with SMTP id d194-v6mr21863598ywb.400.1541437003018; Mon, 05 Nov 2018 08:56:43 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541437002; cv=none; d=google.com; s=arc-20160816; b=Pk/5UI3r0OrOXHeE6b3bZIl4EGVKIQ0Y74WBLtWnbdt89bGMJKBKi7liDSHkdnZ3Ay lJ2TKl/5ruvZP15EVvh7Gz9QZsPW2dta6VgRk5088uz1MhZoctP71bChNwuoyN1nsw6o RSXzp3/o9iBTS2qSiHUeCcwMU1EGBMk1k9tdoNurAsMt9q9BQP2jrAeGkfS4A3Kq1Sg5 dKhZBxHWU+MWNLb2ZtBD9RwElXi0BXN7NQ8nbTF0LLg6MPGxbo4t6XeU5hNqkXV59w41 MEzu3/JrUap7ygN9s1srxQzfHFpXEzREy9XiNNB/wrz3ZFF64sBjsVvc0kg8ZevZ6sZe SIzA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:dkim-signature; bh=oWHxOYIT1CBQcNkQVPyZAD4fnb03uxTtVNnbd4h3cFI=; b=s20QZzv22Se9xxEUPnL7Cqu9QemXRoRnZfSNgNv3WUdnJkerooXnD/suaox96Hqbf6 UXBIHJGWTiIUcqT30F30zMtizVY4yh13KdAB83s/NEM+umUqqNUHAK7VWE8RgDnybgmf Os2QfBkazSgrQQdRasxP+Q07Wp+dG3kox+T3XpmezUnSxSClmNZvx3inAHp+hrfWviIk VQ7ZzQXWvao77gpYSFk2xK1mzs7ZSI4bqmLBMc/fFBDlz19HaBFATYKP6gLnrPpajTZO Xc0xgJCUBpvt3Whbd/e8i83x862PSxbY/KloN5rSwMexQmj+dkkdAeIc5C7ax6pNb7jh 8tMw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b="n1Xn/Fhk"; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 141.146.126.78 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from aserp2120.oracle.com (aserp2120.oracle.com. [141.146.126.78]) by mx.google.com with ESMTPS id h129-v6si26059987ywe.313.2018.11.05.08.56.42 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 05 Nov 2018 08:56:42 -0800 (PST) Received-SPF: pass (google.com: domain of daniel.m.jordan@oracle.com designates 141.146.126.78 as permitted sender) client-ip=141.146.126.78; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b="n1Xn/Fhk"; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 141.146.126.78 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1]) by aserp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5Gs1Mb104349; Mon, 5 Nov 2018 16:56:10 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2018-07-02; bh=oWHxOYIT1CBQcNkQVPyZAD4fnb03uxTtVNnbd4h3cFI=; b=n1Xn/FhkgAZe6vG56pzyNb3VfIWtGTWRfnpYCgG4r5wFcmlkkvPOsBvOFPyH76KhMnZD JsBPg6PwvVB8sfd/3qpFQLYrn465nJFd4XiSGEfUZ2bk+YNuyvwMEhlDy7adK/spdfuL 16Me+vaoHH9JRgNE5ISFpA5bPFqmd42nYK7e6qXGI6LYBj6b4Ybf/HE8Cr5Pf15sz4tM /svgwXPFb51qpxDH+Lv0Q6x7Q9/G7Ag7OCrBLqxYBOr1msUoam3ih9WMdwaofMzPKzl/ zoWbB5iB+fIkac46iR22gzY0zrPaODmclw83jDfaZwHTIhOfQxoy/7oe/JbAf23UPgx/ 2w== Received: from userv0022.oracle.com (userv0022.oracle.com [156.151.31.74]) by aserp2120.oracle.com with ESMTP id 2nh3mpg57w-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:10 +0000 Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236]) by userv0022.oracle.com (8.14.4/8.14.4) with ESMTP id wA5Gu8wh024398 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:09 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id wA5Gu80K017728; Mon, 5 Nov 2018 16:56:08 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:08 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 02/13] ktask: multithread CPU-intensive kernel work Date: Mon, 5 Nov 2018 11:55:47 -0500 Message-Id: <20181105165558.11698-3-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=2 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP A single CPU can spend an excessive amount of time in the kernel operating on large amounts of data. Often these situations arise during initialization- and destruction-related tasks, where the data involved scales with system size. These long-running jobs can slow startup and shutdown of applications and the system itself while extra CPUs sit idle. To ensure that applications and the kernel continue to perform well as core counts and memory sizes increase, harness these idle CPUs to complete such jobs more quickly. ktask is a generic framework for parallelizing CPU-intensive work in the kernel. The API is generic enough to add concurrency to many different kinds of tasks--for example, zeroing a range of pages or evicting a list of inodes--and aims to save its clients the trouble of splitting up the work, choosing the number of threads to use, maintaining an efficient concurrency level, starting these threads, and load balancing the work between them. The Documentation patch earlier in this series, from which the above was swiped, has more background. Inspired by work from Pavel Tatashin, Steve Sistare, and Jonathan Adams. Signed-off-by: Daniel Jordan Suggested-by: Pavel Tatashin Suggested-by: Steve Sistare Suggested-by: Jonathan Adams --- include/linux/ktask.h | 237 +++++++++++++++++++ init/Kconfig | 11 + init/main.c | 2 + kernel/Makefile | 2 +- kernel/ktask.c | 526 ++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 777 insertions(+), 1 deletion(-) create mode 100644 include/linux/ktask.h create mode 100644 kernel/ktask.c diff --git a/include/linux/ktask.h b/include/linux/ktask.h new file mode 100644 index 000000000000..9c75a93b51b9 --- /dev/null +++ b/include/linux/ktask.h @@ -0,0 +1,237 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * ktask.h - framework to parallelize CPU-intensive kernel work + * + * For more information, see Documentation/core-api/ktask.rst. + * + * Copyright (c) 2018 Oracle Corporation + * Author: Daniel Jordan + */ +#ifndef _LINUX_KTASK_H +#define _LINUX_KTASK_H + +#include +#include + +#define KTASK_RETURN_SUCCESS 0 + +/** + * struct ktask_node - Holds per-NUMA-node information about a task. + * + * @kn_start: An object that describes the start of the task on this NUMA node. + * @kn_task_size: size of this node's work (units are task-specific) + * @kn_nid: NUMA node id to run threads on + */ +struct ktask_node { + void *kn_start; + size_t kn_task_size; + int kn_nid; +}; + +/** + * typedef ktask_thread_func + * + * Called on each chunk of work that a ktask thread does. A thread may call + * this multiple times during one task. + * + * @start: An object that describes the start of the chunk. + * @end: An object that describes the end of the chunk. + * @arg: The thread function argument (provided with struct ktask_ctl). + * + * RETURNS: + * KTASK_RETURN_SUCCESS or a client-specific nonzero error code. + */ +typedef int (*ktask_thread_func)(void *start, void *end, void *arg); + +/** + * typedef ktask_iter_func + * + * An iterator function that advances the position by size units. + * + * @position: An object that describes the current position in the task. + * @size: The amount to advance in the task (in task-specific units). + * + * RETURNS: + * An object representing the new position. + */ +typedef void *(*ktask_iter_func)(void *position, size_t size); + +/** + * ktask_iter_range + * + * An iterator function for a contiguous range such as an array or address + * range. This is the default iterator; clients may override with + * ktask_ctl_set_iter_func. + * + * @position: An object that describes the current position in the task. + * Interpreted as an unsigned long. + * @size: The amount to advance in the task (in task-specific units). + * + * RETURNS: + * (position + size) + */ +void *ktask_iter_range(void *position, size_t size); + +/** + * struct ktask_ctl - Client-provided per-task control information. + * + * @kc_thread_func: A thread function that completes one chunk of the task per + * call. + * @kc_func_arg: An argument to be passed to the thread and undo functions. + * @kc_iter_func: An iterator function to advance the iterator by some number + * of task-specific units. + * @kc_min_chunk_size: The minimum chunk size in task-specific units. This + * allows the client to communicate the minimum amount of + * work that's appropriate for one worker thread to do at + * once. + * @kc_max_threads: max threads to use for the task, actual number may be less + * depending on CPU count, task size, and minimum chunk size. + */ +struct ktask_ctl { + /* Required arguments set with DEFINE_KTASK_CTL. */ + ktask_thread_func kc_thread_func; + void *kc_func_arg; + size_t kc_min_chunk_size; + + /* Optional, can set with ktask_ctl_set_*. Defaults on the right. */ + ktask_iter_func kc_iter_func; /* ktask_iter_range */ + size_t kc_max_threads; /* 0 (uses internal limit) */ +}; + +#define KTASK_CTL_INITIALIZER(thread_func, func_arg, min_chunk_size) \ + { \ + .kc_thread_func = (ktask_thread_func)(thread_func), \ + .kc_func_arg = (func_arg), \ + .kc_min_chunk_size = (min_chunk_size), \ + .kc_iter_func = (ktask_iter_range), \ + .kc_max_threads = 0, \ + } + +/* + * KTASK_CTL_INITIALIZER casts 'thread_func' to be of type ktask_thread_func so + * clients can write cleaner thread functions by relieving them of the need to + * cast the three void * arguments. Clients can just use the actual argument + * types instead. + */ +#define DEFINE_KTASK_CTL(ctl_name, thread_func, func_arg, min_chunk_size) \ + struct ktask_ctl ctl_name = \ + KTASK_CTL_INITIALIZER(thread_func, func_arg, min_chunk_size) \ + +/** + * ktask_ctl_set_iter_func - Set a task-specific iterator + * + * Overrides the default iterator, ktask_iter_range. + * + * Casts the type of the iterator function so its arguments can be + * client-specific (see the comment above DEFINE_KTASK_CTL). + * + * @ctl: A control structure containing information about the task. + * @iter_func: Walks a given number of units forward in the task, returning + * an iterator corresponding to the new position. + */ +#define ktask_ctl_set_iter_func(ctl, iter_func) \ + ((ctl)->kc_iter_func = (ktask_iter_func)(iter_func)) + +/** + * ktask_ctl_set_max_threads - Set a task-specific maximum number of threads + * + * This overrides the default maximum, which is KTASK_DEFAULT_MAX_THREADS. + * + * @ctl: A control structure containing information about the task. + * @max_threads: The maximum number of threads to be started for this task. + * The actual number of threads may be less than this. + */ +static inline void ktask_ctl_set_max_threads(struct ktask_ctl *ctl, + size_t max_threads) +{ + ctl->kc_max_threads = max_threads; +} + +/* + * The minimum chunk sizes for tasks that operate on ranges of memory. For + * now, say 128M. + */ +#define KTASK_MEM_CHUNK (1ul << 27) +#define KTASK_PTE_MINCHUNK (KTASK_MEM_CHUNK / PAGE_SIZE) +#define KTASK_PMD_MINCHUNK (KTASK_MEM_CHUNK / PMD_SIZE) + +#ifdef CONFIG_KTASK + +/** + * ktask_run - Runs one task. + * + * Starts threads to complete one task with the given thread function. Waits + * for the task to finish before returning. + * + * On a NUMA system, threads run on the current node. This is designed to + * mirror other parts of the kernel that favor locality, such as the default + * memory policy of allocating pages from the same node as the calling thread. + * ktask_run_numa may be used to get more control over where threads run. + * + * @start: An object that describes the start of the task. The client thread + * function interprets the object however it sees fit (e.g. an array + * index, a simple pointer, or a pointer to a more complicated + * representation of job position). + * @task_size: The size of the task (units are task-specific). + * @ctl: A control structure containing information about the task, including + * the client thread function. + * + * RETURNS: + * KTASK_RETURN_SUCCESS or a client-specific nonzero error code. + */ +int ktask_run(void *start, size_t task_size, struct ktask_ctl *ctl); + +/** + * ktask_run_numa - Runs one task while accounting for NUMA locality. + * + * Starts threads on the requested nodes to complete one task with the given + * thread function. The client is responsible for organizing the work along + * NUMA boundaries in the 'nodes' array. Waits for the task to finish before + * returning. + * + * In the special case of NUMA_NO_NODE, threads are allowed to run on any node. + * This is distinct from ktask_run, which runs threads on the current node. + * + * @nodes: An array of nodes. + * @nr_nodes: Length of the 'nodes' array. + * @ctl: Control structure containing information about the task. + * + * RETURNS: + * KTASK_RETURN_SUCCESS or a client-specific nonzero error code. + */ +int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes, + struct ktask_ctl *ctl); + +void ktask_init(void); + +#else /* CONFIG_KTASK */ + +static inline int ktask_run(void *start, size_t task_size, + struct ktask_ctl *ctl) +{ + return ctl->kc_thread_func(start, ctl->kc_iter_func(start, task_size), + ctl->kc_func_arg); +} + +static inline int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes, + struct ktask_ctl *ctl) +{ + size_t i; + int err = KTASK_RETURN_SUCCESS; + + for (i = 0; i < nr_nodes; ++i) { + void *start = nodes[i].kn_start; + void *end = ctl->kc_iter_func(start, nodes[i].kn_task_size); + + err = ctl->kc_thread_func(start, end, ctl->kc_func_arg); + if (err != KTASK_RETURN_SUCCESS) + break; + } + + return err; +} + +static inline void ktask_init(void) { } + +#endif /* CONFIG_KTASK */ +#endif /* _LINUX_KTASK_H */ diff --git a/init/Kconfig b/init/Kconfig index 41583f468cb4..ed82f76ed0b7 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -346,6 +346,17 @@ config AUDIT_TREE depends on AUDITSYSCALL select FSNOTIFY +config KTASK + bool "Multithread CPU-intensive kernel work" + depends on SMP + default y + help + Parallelize CPU-intensive kernel work. This feature is designed for + big machines that can take advantage of their extra CPUs to speed up + large kernel tasks. When enabled, kworker threads may occupy more + CPU time during these kernel tasks, but these threads are throttled + when other tasks on the system need CPU time. + source "kernel/irq/Kconfig" source "kernel/time/Kconfig" source "kernel/Kconfig.preempt" diff --git a/init/main.c b/init/main.c index ee147103ba1b..c689f00eab95 100644 --- a/init/main.c +++ b/init/main.c @@ -92,6 +92,7 @@ #include #include #include +#include #include #include @@ -1145,6 +1146,7 @@ static noinline void __init kernel_init_freeable(void) smp_init(); sched_init_smp(); + ktask_init(); page_alloc_init_late(); diff --git a/kernel/Makefile b/kernel/Makefile index 63b643eb7e70..ce238cb7add5 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -10,7 +10,7 @@ obj-y = fork.o exec_domain.o panic.o \ extable.o params.o \ kthread.o sys_ni.o nsproxy.o \ notifier.o ksysfs.o cred.o reboot.o \ - async.o range.o smpboot.o ucount.o + async.o range.o smpboot.o ucount.o ktask.o obj-$(CONFIG_MODULES) += kmod.o obj-$(CONFIG_MULTIUSER) += groups.o diff --git a/kernel/ktask.c b/kernel/ktask.c new file mode 100644 index 000000000000..a7b2b5a62737 --- /dev/null +++ b/kernel/ktask.c @@ -0,0 +1,526 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * ktask.c - framework to parallelize CPU-intensive kernel work + * + * For more information, see Documentation/core-api/ktask.rst. + * + * Copyright (c) 2018 Oracle Corporation + * Author: Daniel Jordan + */ + +#define pr_fmt(fmt) "ktask: " fmt + +#include + +#ifdef CONFIG_KTASK + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* Resource limits on the amount of workqueue items queued through ktask. */ +static DEFINE_SPINLOCK(ktask_rlim_lock); +/* Work items queued on all nodes (includes NUMA_NO_NODE) */ +static size_t ktask_rlim_cur; +static size_t ktask_rlim_max; +/* Work items queued per node */ +static size_t *ktask_rlim_node_cur; +static size_t *ktask_rlim_node_max; + +/* Allow only 80% of the cpus to be running additional ktask threads. */ +#define KTASK_CPUFRAC_NUMER 4 +#define KTASK_CPUFRAC_DENOM 5 + +/* Used to pass ktask data to the workqueue API. */ +struct ktask_work { + struct work_struct kw_work; + struct ktask_task *kw_task; + int kw_ktask_node_i; + int kw_queue_nid; + struct list_head kw_list; /* ktask_free_works linkage */ +}; + +static LIST_HEAD(ktask_free_works); +static struct ktask_work *ktask_works; + +/* Represents one task. This is for internal use only. */ +struct ktask_task { + struct ktask_ctl kt_ctl; + size_t kt_total_size; + size_t kt_chunk_size; + /* protects this struct and struct ktask_work's of a running task */ + struct mutex kt_mutex; + struct ktask_node *kt_nodes; + size_t kt_nr_nodes; + size_t kt_nr_nodes_left; + size_t kt_nworks; + size_t kt_nworks_fini; + int kt_error; /* first error from thread_func */ + struct completion kt_ktask_done; +}; + +/* + * Shrink the size of each job by this shift amount to load balance between the + * worker threads. + */ +#define KTASK_LOAD_BAL_SHIFT 2 + +#define KTASK_DEFAULT_MAX_THREADS 4 + +/* Maximum number of threads for a single task. */ +int ktask_max_threads = KTASK_DEFAULT_MAX_THREADS; + +static struct workqueue_struct *ktask_wq; +static struct workqueue_struct *ktask_nonuma_wq; + +static void ktask_thread(struct work_struct *work); + +static void ktask_init_work(struct ktask_work *kw, struct ktask_task *kt, + size_t ktask_node_i, size_t queue_nid) +{ + INIT_WORK(&kw->kw_work, ktask_thread); + kw->kw_task = kt; + kw->kw_ktask_node_i = ktask_node_i; + kw->kw_queue_nid = queue_nid; +} + +static void ktask_queue_work(struct ktask_work *kw) +{ + struct workqueue_struct *wq; + int cpu; + + if (kw->kw_queue_nid == NUMA_NO_NODE) { + /* + * If no node is specified, use ktask_nonuma_wq to + * allow the thread to run on any node, but fall back + * to ktask_wq if we couldn't allocate ktask_nonuma_wq. + */ + cpu = WORK_CPU_UNBOUND; + wq = (ktask_nonuma_wq) ?: ktask_wq; + } else { + /* + * WQ_UNBOUND workqueues, such as the one ktask uses, + * execute work on some CPU from the node of the CPU we + * pass to queue_work_on, so just pick any CPU to stand + * for the node on NUMA systems. + * + * On non-NUMA systems, cpumask_of_node becomes + * cpu_online_mask. + */ + cpu = cpumask_any(cpumask_of_node(kw->kw_queue_nid)); + wq = ktask_wq; + } + + WARN_ON(!queue_work_on(cpu, wq, &kw->kw_work)); +} + +/* Returns true if we're migrating this part of the task to another node. */ +static bool ktask_node_migrate(struct ktask_node *old_kn, struct ktask_node *kn, + size_t ktask_node_i, struct ktask_work *kw, + struct ktask_task *kt) +{ + int new_queue_nid; + + /* + * Don't migrate a user thread, otherwise migrate only if we're going + * to a different node. + */ + if (!IS_ENABLED(CONFIG_NUMA) || !(current->flags & PF_KTHREAD) || + kn->kn_nid == old_kn->kn_nid || num_online_nodes() == 1) + return false; + + /* Adjust resource limits. */ + spin_lock(&ktask_rlim_lock); + if (kw->kw_queue_nid != NUMA_NO_NODE) + --ktask_rlim_node_cur[kw->kw_queue_nid]; + + if (kn->kn_nid != NUMA_NO_NODE && + ktask_rlim_node_cur[kw->kw_queue_nid] < + ktask_rlim_node_max[kw->kw_queue_nid]) { + new_queue_nid = kn->kn_nid; + ++ktask_rlim_node_cur[new_queue_nid]; + } else { + new_queue_nid = NUMA_NO_NODE; + } + spin_unlock(&ktask_rlim_lock); + + ktask_init_work(kw, kt, ktask_node_i, new_queue_nid); + ktask_queue_work(kw); + + return true; +} + +static void ktask_thread(struct work_struct *work) +{ + struct ktask_work *kw = container_of(work, struct ktask_work, kw_work); + struct ktask_task *kt = kw->kw_task; + struct ktask_ctl *kc = &kt->kt_ctl; + struct ktask_node *kn = &kt->kt_nodes[kw->kw_ktask_node_i]; + bool done; + + mutex_lock(&kt->kt_mutex); + + while (kt->kt_total_size > 0 && kt->kt_error == KTASK_RETURN_SUCCESS) { + void *start, *end; + size_t size; + int ret; + + if (kn->kn_task_size == 0) { + /* The current node is out of work; pick a new one. */ + size_t remaining_nodes_seen = 0; + size_t new_idx = prandom_u32_max(kt->kt_nr_nodes_left); + struct ktask_node *old_kn; + size_t i; + + WARN_ON(kt->kt_nr_nodes_left == 0); + WARN_ON(new_idx >= kt->kt_nr_nodes_left); + for (i = 0; i < kt->kt_nr_nodes; ++i) { + if (kt->kt_nodes[i].kn_task_size == 0) + continue; + + if (remaining_nodes_seen >= new_idx) + break; + + ++remaining_nodes_seen; + } + /* We should have found work on another node. */ + WARN_ON(i >= kt->kt_nr_nodes); + + old_kn = kn; + kn = &kt->kt_nodes[i]; + + /* Start another worker on the node we've chosen. */ + if (ktask_node_migrate(old_kn, kn, i, kw, kt)) { + mutex_unlock(&kt->kt_mutex); + return; + } + } + + start = kn->kn_start; + size = min(kt->kt_chunk_size, kn->kn_task_size); + end = kc->kc_iter_func(start, size); + kn->kn_start = end; + kn->kn_task_size -= size; + WARN_ON(kt->kt_total_size < size); + kt->kt_total_size -= size; + if (kn->kn_task_size == 0) { + WARN_ON(kt->kt_nr_nodes_left == 0); + kt->kt_nr_nodes_left--; + } + + mutex_unlock(&kt->kt_mutex); + + ret = kc->kc_thread_func(start, end, kc->kc_func_arg); + + mutex_lock(&kt->kt_mutex); + + /* Save first error code only. */ + if (kt->kt_error == KTASK_RETURN_SUCCESS && ret != kt->kt_error) + kt->kt_error = ret; + } + + WARN_ON(kt->kt_nr_nodes_left > 0 && + kt->kt_error == KTASK_RETURN_SUCCESS); + + ++kt->kt_nworks_fini; + WARN_ON(kt->kt_nworks_fini > kt->kt_nworks); + done = (kt->kt_nworks_fini == kt->kt_nworks); + mutex_unlock(&kt->kt_mutex); + + if (done) + complete(&kt->kt_ktask_done); +} + +/* + * Returns the number of chunks to break this task into. + * + * The number of chunks will be at least the number of works, but in the common + * case of a large task, the number of chunks will be greater to load balance + * between the workqueue threads in case some of them finish more quickly than + * others. + */ +static size_t ktask_chunk_size(size_t task_size, size_t min_chunk_size, + size_t nworks) +{ + size_t chunk_size; + + if (nworks == 1) + return task_size; + + chunk_size = (task_size / nworks) >> KTASK_LOAD_BAL_SHIFT; + + /* + * chunk_size should be a multiple of min_chunk_size for tasks that + * need to operate in fixed-size batches. + */ + if (chunk_size > min_chunk_size) + chunk_size = rounddown(chunk_size, min_chunk_size); + + return max(chunk_size, min_chunk_size); +} + +/* + * Returns the number of works to be used in the task. This number includes + * the current thread, so a return value of 1 means no extra threads are + * started. + */ +static size_t ktask_init_works(struct ktask_node *nodes, size_t nr_nodes, + struct ktask_task *kt, + struct list_head *works_list) +{ + size_t i, nr_works, nr_works_check; + size_t min_chunk_size = kt->kt_ctl.kc_min_chunk_size; + size_t max_threads = kt->kt_ctl.kc_max_threads; + + if (!ktask_wq) + return 1; + + if (max_threads == 0) + max_threads = ktask_max_threads; + + /* Ensure at least one thread when task_size < min_chunk_size. */ + nr_works_check = DIV_ROUND_UP(kt->kt_total_size, min_chunk_size); + nr_works_check = min_t(size_t, nr_works_check, num_online_cpus()); + nr_works_check = min_t(size_t, nr_works_check, max_threads); + + /* + * Use at least the current thread for this task; check whether + * ktask_rlim allows additional work items to be queued. + */ + nr_works = 1; + spin_lock(&ktask_rlim_lock); + for (i = nr_works; i < nr_works_check; ++i) { + /* Allocate works evenly over the task's given nodes. */ + size_t ktask_node_i = i % nr_nodes; + struct ktask_node *kn = &nodes[ktask_node_i]; + struct ktask_work *kw; + int nid = kn->kn_nid; + int queue_nid; + + WARN_ON(ktask_rlim_cur > ktask_rlim_max); + if (ktask_rlim_cur == ktask_rlim_max) + break; /* No more work items allowed to be queued. */ + + /* Allowed to queue on requested node? */ + if (nid != NUMA_NO_NODE && + ktask_rlim_node_cur[nid] < ktask_rlim_node_max[nid]) { + WARN_ON(ktask_rlim_node_cur[nid] > ktask_rlim_cur); + ++ktask_rlim_node_cur[nid]; + queue_nid = nid; + } else { + queue_nid = NUMA_NO_NODE; + } + + WARN_ON(list_empty(&ktask_free_works)); + kw = list_first_entry(&ktask_free_works, struct ktask_work, + kw_list); + list_move_tail(&kw->kw_list, works_list); + ktask_init_work(kw, kt, ktask_node_i, queue_nid); + + ++ktask_rlim_cur; + ++nr_works; + } + spin_unlock(&ktask_rlim_lock); + + return nr_works; +} + +static void ktask_fini_works(struct ktask_task *kt, + struct list_head *works_list) +{ + struct ktask_work *work; + + spin_lock(&ktask_rlim_lock); + + /* Put the works back on the free list, adjusting rlimits. */ + list_for_each_entry(work, works_list, kw_list) { + if (work->kw_queue_nid != NUMA_NO_NODE) { + WARN_ON(ktask_rlim_node_cur[work->kw_queue_nid] == 0); + --ktask_rlim_node_cur[work->kw_queue_nid]; + } + WARN_ON(ktask_rlim_cur == 0); + --ktask_rlim_cur; + } + list_splice(works_list, &ktask_free_works); + + spin_unlock(&ktask_rlim_lock); +} + +int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes, + struct ktask_ctl *ctl) +{ + size_t i; + struct ktask_work kw; + struct ktask_work *work; + LIST_HEAD(works_list); + struct ktask_task kt = { + .kt_ctl = *ctl, + .kt_total_size = 0, + .kt_nodes = nodes, + .kt_nr_nodes = nr_nodes, + .kt_nr_nodes_left = nr_nodes, + .kt_nworks_fini = 0, + .kt_error = KTASK_RETURN_SUCCESS, + }; + + for (i = 0; i < nr_nodes; ++i) { + kt.kt_total_size += nodes[i].kn_task_size; + if (nodes[i].kn_task_size == 0) + kt.kt_nr_nodes_left--; + + WARN_ON(nodes[i].kn_nid >= MAX_NUMNODES); + } + + if (kt.kt_total_size == 0) + return KTASK_RETURN_SUCCESS; + + mutex_init(&kt.kt_mutex); + init_completion(&kt.kt_ktask_done); + + kt.kt_nworks = ktask_init_works(nodes, nr_nodes, &kt, &works_list); + kt.kt_chunk_size = ktask_chunk_size(kt.kt_total_size, + ctl->kc_min_chunk_size, + kt.kt_nworks); + + list_for_each_entry(work, &works_list, kw_list) + ktask_queue_work(work); + + /* Use the current thread, which saves starting a workqueue worker. */ + ktask_init_work(&kw, &kt, 0, nodes[0].kn_nid); + ktask_thread(&kw.kw_work); + + /* Wait for all the jobs to finish. */ + wait_for_completion(&kt.kt_ktask_done); + + ktask_fini_works(&kt, &works_list); + mutex_destroy(&kt.kt_mutex); + + return kt.kt_error; +} +EXPORT_SYMBOL_GPL(ktask_run_numa); + +int ktask_run(void *start, size_t task_size, struct ktask_ctl *ctl) +{ + struct ktask_node node; + + node.kn_start = start; + node.kn_task_size = task_size; + node.kn_nid = numa_node_id(); + + return ktask_run_numa(&node, 1, ctl); +} +EXPORT_SYMBOL_GPL(ktask_run); + +/* + * Initialize internal limits on work items queued. Work items submitted to + * cmwq capped at 80% of online cpus both system-wide and per-node to maintain + * an efficient level of parallelization at these respective levels. + */ +static bool __init ktask_rlim_init(void) +{ + int node, nr_cpus; + unsigned int nr_node_cpus; + + nr_cpus = num_online_cpus(); + + /* XXX Handle CPU hotplug. */ + if (nr_cpus == 1) + return false; + + ktask_rlim_node_cur = kcalloc(num_possible_nodes(), sizeof(size_t), + GFP_KERNEL); + + ktask_rlim_node_max = kmalloc_array(num_possible_nodes(), + sizeof(size_t), GFP_KERNEL); + + ktask_rlim_max = mult_frac(nr_cpus, KTASK_CPUFRAC_NUMER, + KTASK_CPUFRAC_DENOM); + for_each_node(node) { + nr_node_cpus = cpumask_weight(cpumask_of_node(node)); + ktask_rlim_node_max[node] = mult_frac(nr_node_cpus, + KTASK_CPUFRAC_NUMER, + KTASK_CPUFRAC_DENOM); + } + + return true; +} + +void __init ktask_init(void) +{ + struct workqueue_attrs *attrs; + int i, ret; + + if (!ktask_rlim_init()) + goto out; + + ktask_works = kmalloc_array(ktask_rlim_max, sizeof(struct ktask_work), + GFP_KERNEL); + for (i = 0; i < ktask_rlim_max; ++i) + list_add_tail(&ktask_works[i].kw_list, &ktask_free_works); + + ktask_wq = alloc_workqueue("ktask_wq", WQ_UNBOUND, 0); + if (!ktask_wq) { + pr_warn("disabled (failed to alloc ktask_wq)"); + goto out; + } + + /* + * Threads executing work from this workqueue can run on any node on + * the system. If we get any failures below, use ktask_wq in its + * place. It's better than nothing. + */ + ktask_nonuma_wq = alloc_workqueue("ktask_nonuma_wq", WQ_UNBOUND, 0); + if (!ktask_nonuma_wq) { + pr_warn("disabled (failed to alloc ktask_nonuma_wq)"); + goto alloc_fail; + } + + attrs = alloc_workqueue_attrs(GFP_KERNEL); + if (!attrs) { + pr_warn("disabled (couldn't alloc wq attrs)"); + goto alloc_fail; + } + + attrs->no_numa = true; + + ret = apply_workqueue_attrs(ktask_nonuma_wq, attrs); + if (ret != 0) { + pr_warn("disabled (couldn't apply attrs to ktask_nonuma_wq)"); + goto apply_fail; + } + + free_workqueue_attrs(attrs); +out: + return; + +apply_fail: + free_workqueue_attrs(attrs); +alloc_fail: + if (ktask_wq) + destroy_workqueue(ktask_wq); + if (ktask_nonuma_wq) + destroy_workqueue(ktask_nonuma_wq); + ktask_wq = NULL; + ktask_nonuma_wq = NULL; +} + +#endif /* CONFIG_KTASK */ + +/* + * This function is defined outside CONFIG_KTASK so it can be called in the + * !CONFIG_KTASK versions of ktask_run and ktask_run_numa. + */ +void *ktask_iter_range(void *position, size_t size) +{ + return (char *)position + size; +} +EXPORT_SYMBOL_GPL(ktask_iter_range); From patchwork Mon Nov 5 16:55:48 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668657 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 274EB14E2 for ; Mon, 5 Nov 2018 16:56:43 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 153FC283A8 for ; Mon, 5 Nov 2018 16:56:43 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 08A77293D9; Mon, 5 Nov 2018 16:56:43 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2596D283A8 for ; Mon, 5 Nov 2018 16:56:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0F3366B026F; Mon, 5 Nov 2018 11:56:38 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 07CC86B0272; Mon, 5 Nov 2018 11:56:38 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DF0156B0271; Mon, 5 Nov 2018 11:56:37 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yw1-f70.google.com (mail-yw1-f70.google.com [209.85.161.70]) by kanga.kvack.org (Postfix) with ESMTP id 9C4266B026E for ; Mon, 5 Nov 2018 11:56:37 -0500 (EST) Received: by mail-yw1-f70.google.com with SMTP id q188-v6so7891885ywd.2 for ; Mon, 05 Nov 2018 08:56:37 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=r9EK74CGq12tLNaMapE/4PyK+rMYKf32lRC9GP5eoB4=; b=ULlnaeD/Czw+lCJGMESPmL+FBbKWPoZRy/VvgxE2UNFWhXvBMWFTiKqDNjUbbcdzuz vM6o+UGd9MlGPCHG3Bs0K8gx3XOUB0y4g/kU/7JYDBn+8p6NuJW7jfol2Vwh1cWvmhBC 5hiT128PqrdtAXhK35c0YlrmpaM7NTlTsLzUH3OXx6/GgGCFWbQ9QvlUvNrD1PkPcXxQ 9pdVYjFjH/4IcHryG4OjMwjL+zvxcy44z0VOA9Ar7sJm7bc2YF0vc0osnhmGQCgnuFVy cbSlpmpQ577tnV9et1AQcavObUIcdjDTjWMRfoaTGaG6hfSq+xWRJaX2AtBoQEGsSNRE c+0Q== X-Gm-Message-State: AGRZ1gLiuX+T3YammTNMaoUv2iL598K34wpMs5sAthON52dxRMuW309X XoydzcnWb/bgrjoCr/ASxOp2lXNlQOOFZVuU2Pnx7Q95v2Zdpfl7TINfuL30HMv0fDt0fz1Ejxe dqe6ayJpZCJtEmbWG6acw+59G0k0gka941gYoWxS+VOMXs4DBa4wo7O/on0F8t548Xw== X-Received: by 2002:a81:d4f:: with SMTP id 76-v6mr5273159ywn.101.1541436997329; Mon, 05 Nov 2018 08:56:37 -0800 (PST) X-Google-Smtp-Source: AJdET5fhh+vSKBHIPq8BwhNQAp0zVz0QlAwXJ5iJmFX2CAO33oHbXHeDjrqXQ9OtpAj+kMHFyI1i X-Received: by 2002:a81:d4f:: with SMTP id 76-v6mr5273096ywn.101.1541436996358; Mon, 05 Nov 2018 08:56:36 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541436996; cv=none; d=google.com; s=arc-20160816; b=gMD71kSrvWIufk4wQo/ve3kChrbhIfYvy6TFmcf73034EeorfTbSNBvoE7UslFOooy ZMfmNeBz5SX9X+Ny7ptB+QPraxb1C/v2+g5Q/U+Dwl6Pz5JNf8RN/vE+fSJOtj0KjNKW EBMMNqBezl9n8R9BR605srdMPejMXVT8ikInsqJksJSmhJQhU2eqKIYOSbbtyaqpmW2x oj9TOG5K1mt928Z/kLylc7q5g236s3PbZZS3kSijTJgX5RoYX7/13lFFFQutjwg1Io4e hM0dMtQpwDCXHnPB+brs0OU93mpD926vRU0pO+XppcC9xiE2ncKIsg7fVZOxynEqDPHL Skfg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:dkim-signature; bh=r9EK74CGq12tLNaMapE/4PyK+rMYKf32lRC9GP5eoB4=; b=DhG8UWtIe5twDfq/48o6rnM21EtoIUZ3tXwBRCNroJupvMeX2s4B+CThcvxybRJeyE axB0vqnbr8t2I7LiqafcyQrRTrhcfqVgoYOn6zd+IKhKM/usZIFNSqWrTklx24qdeGaA AmdP9mUyF8IZtBjdZ2x7YmVu1R7hxvkpQ3KIDBWPHNbWiM3LblNGMmAa4I8iz6k4cgld qoTFMSRfIFFv9QZJ+mOX/h5SZpon0HVCv40//ILMLDXnL/1r7EZUaNB6Dhv1cXQ2Z7TG 7QuoSIMIfgcTgz+lY7bwO7zgPIvMmjSxNSgERvWoPhh4/XWLrrnvO4uCchSQlWP9EWdi f7aw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=2BYNYTmH; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 141.146.126.78 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from aserp2120.oracle.com (aserp2120.oracle.com. [141.146.126.78]) by mx.google.com with ESMTPS id 84-v6si7647388yby.78.2018.11.05.08.56.36 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 05 Nov 2018 08:56:36 -0800 (PST) Received-SPF: pass (google.com: domain of daniel.m.jordan@oracle.com designates 141.146.126.78 as permitted sender) client-ip=141.146.126.78; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=2BYNYTmH; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 141.146.126.78 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1]) by aserp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5Gs2dF104430; Mon, 5 Nov 2018 16:56:12 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2018-07-02; bh=r9EK74CGq12tLNaMapE/4PyK+rMYKf32lRC9GP5eoB4=; b=2BYNYTmHndcc9AtITIKvpRSM1QBsihQ9IWPChTb7cyKJB0Thb+tB3ma2zfr25u3qFQd+ kOBpr6MHXfz2ptLnvploitatlkhZ/jQ8eMmCoNXezIzt59/2z6LaLsmHCMVAOQBRnp+d Q+/90S3DN1WT5g6uYqKPpQkpUqJNaV5YuJE6bpt/RgyRPIveZtEV9eLIaCTyIvorkLOZ EpKur+8+QGDme/kOApS6CmjqCIUqerYjBtAdOXJSTlNsPZcB83O6U+qSs8/+GNI1mgDL nTO+EtjbuNHKPEuU8Uj2QKCw3owstK3YF/bvGbBcgdC0M8+rAPK6KA5fdXbpz/VrXa7k BA== Received: from aserv0021.oracle.com (aserv0021.oracle.com [141.146.126.233]) by aserp2120.oracle.com with ESMTP id 2nh3mpg584-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:12 +0000 Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by aserv0021.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuB96024422 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:11 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id wA5GuARw008323; Mon, 5 Nov 2018 16:56:10 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:10 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 03/13] ktask: add undo support Date: Mon, 5 Nov 2018 11:55:48 -0500 Message-Id: <20181105165558.11698-4-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=2 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Tasks can fail midway through their work. To recover, the finished chunks of work need to be undone in a task-specific way. Allow ktask clients to pass an "undo" callback that is responsible for undoing one chunk of work. To avoid multiple levels of error handling, do not allow the callback to fail. For simplicity and because it's a slow path, undoing is not multithreaded. Signed-off-by: Daniel Jordan --- include/linux/ktask.h | 36 +++++++++++- kernel/ktask.c | 125 +++++++++++++++++++++++++++++++++++------- 2 files changed, 138 insertions(+), 23 deletions(-) diff --git a/include/linux/ktask.h b/include/linux/ktask.h index 9c75a93b51b9..30a6a88e5dad 100644 --- a/include/linux/ktask.h +++ b/include/linux/ktask.h @@ -10,6 +10,7 @@ #ifndef _LINUX_KTASK_H #define _LINUX_KTASK_H +#include #include #include @@ -23,9 +24,14 @@ * @kn_nid: NUMA node id to run threads on */ struct ktask_node { - void *kn_start; - size_t kn_task_size; - int kn_nid; + void *kn_start; + size_t kn_task_size; + int kn_nid; + + /* Private fields below - do not touch these. */ + void *kn_position; + size_t kn_remaining_size; + struct list_head kn_failed_works; }; /** @@ -43,6 +49,14 @@ struct ktask_node { */ typedef int (*ktask_thread_func)(void *start, void *end, void *arg); +/** + * typedef ktask_undo_func + * + * The same as ktask_thread_func, with the exception that it must always + * succeed, so it doesn't return anything. + */ +typedef void (*ktask_undo_func)(void *start, void *end, void *arg); + /** * typedef ktask_iter_func * @@ -77,6 +91,11 @@ void *ktask_iter_range(void *position, size_t size); * * @kc_thread_func: A thread function that completes one chunk of the task per * call. + * @kc_undo_func: A function that undoes one chunk of the task per call. + * If non-NULL and error(s) occur during the task, this is + * called on all successfully completed chunks of work. The + * chunk(s) in which failure occurs should be handled in + * kc_thread_func. * @kc_func_arg: An argument to be passed to the thread and undo functions. * @kc_iter_func: An iterator function to advance the iterator by some number * of task-specific units. @@ -90,6 +109,7 @@ void *ktask_iter_range(void *position, size_t size); struct ktask_ctl { /* Required arguments set with DEFINE_KTASK_CTL. */ ktask_thread_func kc_thread_func; + ktask_undo_func kc_undo_func; void *kc_func_arg; size_t kc_min_chunk_size; @@ -101,6 +121,7 @@ struct ktask_ctl { #define KTASK_CTL_INITIALIZER(thread_func, func_arg, min_chunk_size) \ { \ .kc_thread_func = (ktask_thread_func)(thread_func), \ + .kc_undo_func = NULL, \ .kc_func_arg = (func_arg), \ .kc_min_chunk_size = (min_chunk_size), \ .kc_iter_func = (ktask_iter_range), \ @@ -132,6 +153,15 @@ struct ktask_ctl { #define ktask_ctl_set_iter_func(ctl, iter_func) \ ((ctl)->kc_iter_func = (ktask_iter_func)(iter_func)) +/** + * ktask_ctl_set_undo_func - Designate an undo function to unwind from error + * + * @ctl: A control structure containing information about the task. + * @undo_func: Undoes a piece of the task. + */ +#define ktask_ctl_set_undo_func(ctl, undo_func) \ + ((ctl)->kc_undo_func = (ktask_undo_func)(undo_func)) + /** * ktask_ctl_set_max_threads - Set a task-specific maximum number of threads * diff --git a/kernel/ktask.c b/kernel/ktask.c index a7b2b5a62737..b91c62f14dcd 100644 --- a/kernel/ktask.c +++ b/kernel/ktask.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -46,7 +47,12 @@ struct ktask_work { struct ktask_task *kw_task; int kw_ktask_node_i; int kw_queue_nid; - struct list_head kw_list; /* ktask_free_works linkage */ + /* task units from kn_start to kw_error_start */ + size_t kw_error_offset; + void *kw_error_start; + void *kw_error_end; + /* ktask_free_works, kn_failed_works linkage */ + struct list_head kw_list; }; static LIST_HEAD(ktask_free_works); @@ -170,11 +176,11 @@ static void ktask_thread(struct work_struct *work) mutex_lock(&kt->kt_mutex); while (kt->kt_total_size > 0 && kt->kt_error == KTASK_RETURN_SUCCESS) { - void *start, *end; - size_t size; + void *position, *end; + size_t size, position_offset; int ret; - if (kn->kn_task_size == 0) { + if (kn->kn_remaining_size == 0) { /* The current node is out of work; pick a new one. */ size_t remaining_nodes_seen = 0; size_t new_idx = prandom_u32_max(kt->kt_nr_nodes_left); @@ -184,7 +190,7 @@ static void ktask_thread(struct work_struct *work) WARN_ON(kt->kt_nr_nodes_left == 0); WARN_ON(new_idx >= kt->kt_nr_nodes_left); for (i = 0; i < kt->kt_nr_nodes; ++i) { - if (kt->kt_nodes[i].kn_task_size == 0) + if (kt->kt_nodes[i].kn_remaining_size == 0) continue; if (remaining_nodes_seen >= new_idx) @@ -205,27 +211,40 @@ static void ktask_thread(struct work_struct *work) } } - start = kn->kn_start; - size = min(kt->kt_chunk_size, kn->kn_task_size); - end = kc->kc_iter_func(start, size); - kn->kn_start = end; - kn->kn_task_size -= size; + position = kn->kn_position; + position_offset = kn->kn_task_size - kn->kn_remaining_size; + size = min(kt->kt_chunk_size, kn->kn_remaining_size); + end = kc->kc_iter_func(position, size); + kn->kn_position = end; + kn->kn_remaining_size -= size; WARN_ON(kt->kt_total_size < size); kt->kt_total_size -= size; - if (kn->kn_task_size == 0) { + if (kn->kn_remaining_size == 0) { WARN_ON(kt->kt_nr_nodes_left == 0); kt->kt_nr_nodes_left--; } mutex_unlock(&kt->kt_mutex); - ret = kc->kc_thread_func(start, end, kc->kc_func_arg); + ret = kc->kc_thread_func(position, end, kc->kc_func_arg); mutex_lock(&kt->kt_mutex); - /* Save first error code only. */ - if (kt->kt_error == KTASK_RETURN_SUCCESS && ret != kt->kt_error) - kt->kt_error = ret; + if (ret != KTASK_RETURN_SUCCESS) { + /* Save first error code only. */ + if (kt->kt_error == KTASK_RETURN_SUCCESS) + kt->kt_error = ret; + /* + * If this task has an undo function, save information + * about where this thread failed for ktask_undo. + */ + if (kc->kc_undo_func) { + list_move(&kw->kw_list, &kn->kn_failed_works); + kw->kw_error_start = position; + kw->kw_error_offset = position_offset; + kw->kw_error_end = end; + } + } } WARN_ON(kt->kt_nr_nodes_left > 0 && @@ -335,26 +354,85 @@ static size_t ktask_init_works(struct ktask_node *nodes, size_t nr_nodes, } static void ktask_fini_works(struct ktask_task *kt, + struct ktask_work *stack_work, struct list_head *works_list) { - struct ktask_work *work; + struct ktask_work *work, *next; spin_lock(&ktask_rlim_lock); /* Put the works back on the free list, adjusting rlimits. */ - list_for_each_entry(work, works_list, kw_list) { + list_for_each_entry_safe(work, next, works_list, kw_list) { + if (work == stack_work) { + /* On this thread's stack, so not subject to rlimits. */ + list_del(&work->kw_list); + continue; + } if (work->kw_queue_nid != NUMA_NO_NODE) { WARN_ON(ktask_rlim_node_cur[work->kw_queue_nid] == 0); --ktask_rlim_node_cur[work->kw_queue_nid]; } WARN_ON(ktask_rlim_cur == 0); --ktask_rlim_cur; + list_move(&work->kw_list, &ktask_free_works); } - list_splice(works_list, &ktask_free_works); - spin_unlock(&ktask_rlim_lock); } +static int ktask_error_cmp(void *unused, struct list_head *a, + struct list_head *b) +{ + struct ktask_work *work_a = list_entry(a, struct ktask_work, kw_list); + struct ktask_work *work_b = list_entry(b, struct ktask_work, kw_list); + + if (work_a->kw_error_offset < work_b->kw_error_offset) + return -1; + else if (work_a->kw_error_offset > work_b->kw_error_offset) + return 1; + return 0; +} + +static void ktask_undo(struct ktask_node *nodes, size_t nr_nodes, + struct ktask_ctl *ctl, struct list_head *works_list) +{ + size_t i; + + for (i = 0; i < nr_nodes; ++i) { + struct ktask_node *kn = &nodes[i]; + struct list_head *failed_works = &kn->kn_failed_works; + struct ktask_work *failed_work; + void *undo_pos = kn->kn_start; + void *undo_end; + + /* Sort so the failed ranges can be checked as we go. */ + list_sort(NULL, failed_works, ktask_error_cmp); + + /* Undo completed work on this node, skipping failed ranges. */ + while (undo_pos != kn->kn_position) { + failed_work = list_first_entry_or_null(failed_works, + struct ktask_work, + kw_list); + if (failed_work) + undo_end = failed_work->kw_error_start; + else + undo_end = kn->kn_position; + + if (undo_pos != undo_end) { + ctl->kc_undo_func(undo_pos, undo_end, + ctl->kc_func_arg); + } + + if (failed_work) { + undo_pos = failed_work->kw_error_end; + list_move(&failed_work->kw_list, works_list); + } else { + undo_pos = undo_end; + } + } + WARN_ON(!list_empty(failed_works)); + } +} + int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes, struct ktask_ctl *ctl) { @@ -374,6 +452,9 @@ int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes, for (i = 0; i < nr_nodes; ++i) { kt.kt_total_size += nodes[i].kn_task_size; + nodes[i].kn_position = nodes[i].kn_start; + nodes[i].kn_remaining_size = nodes[i].kn_task_size; + INIT_LIST_HEAD(&nodes[i].kn_failed_works); if (nodes[i].kn_task_size == 0) kt.kt_nr_nodes_left--; @@ -396,12 +477,16 @@ int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes, /* Use the current thread, which saves starting a workqueue worker. */ ktask_init_work(&kw, &kt, 0, nodes[0].kn_nid); + INIT_LIST_HEAD(&kw.kw_list); ktask_thread(&kw.kw_work); /* Wait for all the jobs to finish. */ wait_for_completion(&kt.kt_ktask_done); - ktask_fini_works(&kt, &works_list); + if (kt.kt_error && ctl->kc_undo_func) + ktask_undo(nodes, nr_nodes, ctl, &works_list); + + ktask_fini_works(&kt, &kw, &works_list); mutex_destroy(&kt.kt_mutex); return kt.kt_error; From patchwork Mon Nov 5 16:55:49 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668651 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id EB92D14BD for ; Mon, 5 Nov 2018 16:56:36 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D9CA7283A8 for ; Mon, 5 Nov 2018 16:56:36 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id CD21E293D9; Mon, 5 Nov 2018 16:56:36 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3A605283A8 for ; Mon, 5 Nov 2018 16:56:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E88E66B0266; Mon, 5 Nov 2018 11:56:34 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id E380E6B026D; Mon, 5 Nov 2018 11:56:34 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D26E46B026E; Mon, 5 Nov 2018 11:56:34 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-it1-f198.google.com (mail-it1-f198.google.com [209.85.166.198]) by kanga.kvack.org (Postfix) with ESMTP id AA6606B0266 for ; Mon, 5 Nov 2018 11:56:34 -0500 (EST) Received: by mail-it1-f198.google.com with SMTP id r79-v6so1659949itc.4 for ; Mon, 05 Nov 2018 08:56:34 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=gutsZD7DMWVODOwljz7ZZ5gCU9HyKt2gG0UQQPSwfmk=; b=VO93fN/cp7HBtpY5np7Yavwqdzr/6i6JtjvFMzO2VFS8R+pSfSlqzqHWH4CMmKSzyt i9F2HZVnPFB1fTjaYQUjMhFF5bSyhaLQsFM+uZC6Tf20+j1dyw9sHJAi57mJLFCVbo+s g4cJWaQtqT8cFnTFluwDthJHBPCPVs91/nWkePwyYIJ4HXOCdAQIazTiYQZxgVHfntRV oBq93b7CZcWakV0h40AQPmRmBhVeg6FZH1DF0vTeE0OSd39mVRLTSLgizLZAPYwkd0TD woBIpI9mGVouZnn/tsPMzs5jbzlhV9A6jjBCOXHDPn8q5ysNK0H5evAMSmIZZHxbyfgt amNA== X-Gm-Message-State: AGRZ1gKjAesZyHMRTOvboxB28MEqC8lLHjlZlL5bNMHPKHk0eELuxAy+ FwJ6n5DXUFIQcTh7XSILHn2aB6rRAaKZnUqDHsjwHM46hcwXNfYlUdKEDD9fubV4oD/w2GHVcgC TdULy1b8tHJDEj5ehpha6x+z7lenO642tePZIInt2wf+bhiS2RyR3W6XsU5Eq1QxxOA== X-Received: by 2002:a24:eb0b:: with SMTP id h11-v6mr7336561itj.47.1541436994408; Mon, 05 Nov 2018 08:56:34 -0800 (PST) X-Google-Smtp-Source: AJdET5eD1XJ0eZlBBOF3jyMXvc6o/NOL4nhcYhXG3Ctlmg4Da+oqdzya8rz4q5T7VB7oP6mrabEx X-Received: by 2002:a24:eb0b:: with SMTP id h11-v6mr7336509itj.47.1541436993371; Mon, 05 Nov 2018 08:56:33 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541436993; cv=none; d=google.com; s=arc-20160816; b=WgFu6UMo550Y6Zw0Oc0HTzJx2N1zkqAsELyGKKMlKnoVd/tw4kRN+iy3sTddoCEWup mlXjCsTtBZ4HyKr1812FTwqDS9/ynbV3PPajtyzCSdb1+mqKVdix5Hik5a9491NE2q5I fOWwegDgYDOg+X96sLR5Ep436Bx2Z08T0zxDg9R6UgR2myUC11qDHsvdniw9O0CCjU8U SoAiYUIbGZr/KO0wGfOQ4fXGgVlu7ctPOz0xJu7vfufDt0fJpLwIqDUrtHz1jT3Xf8gk jGs36brNjkhm8dIGCUdVxr1qMZV29ed0iNeEYpBZd3KJis6b0nWp8ikv0vqSTDSDMgp/ Lihw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:dkim-signature; bh=gutsZD7DMWVODOwljz7ZZ5gCU9HyKt2gG0UQQPSwfmk=; b=Qewtu8Ssag19YgycG5rucSBnojbL7Ux9V3NkNIxxhhEfgapbTt5xFb4JU0BF497YpK vrq8hcgW82i6fSC+vnm/tKR9lQUHszZ728xKdi6kJxGWuK5CB8vZ7ouqayWHLscdAixn s+CyFlnNqlIjQGVjyjlCOklJ1bgUuy87UrV2b2ekeq/6dqSU/zEI+ztXi5T2/K7CerYr A70SxwLVo/XlwTuV1URoB3nA/RNtyk98eyXwst1G3NgKeeGDIIjUHJbbXD5XHRb2mSnB /Ur4MDsCIpYsG77KgTPUg/8oK3TRuiZTTTyQBw6nvuXRjfkwPI0rtECUNf+k0ua+M7gj zOEA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=EkKCZB+T; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.85 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from userp2120.oracle.com (userp2120.oracle.com. [156.151.31.85]) by mx.google.com with ESMTPS id a192-v6si12015868ita.36.2018.11.05.08.56.33 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 05 Nov 2018 08:56:33 -0800 (PST) Received-SPF: pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.85 as permitted sender) client-ip=156.151.31.85; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=EkKCZB+T; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.85 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5Gs0k6052184; Mon, 5 Nov 2018 16:56:14 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2018-07-02; bh=gutsZD7DMWVODOwljz7ZZ5gCU9HyKt2gG0UQQPSwfmk=; b=EkKCZB+TXN9kcQy8IJsFm22o2vVRS07DBy+dXqp9LYMjvEMjD1joAEahNPTZWc9ZSOmz jLoBY+e+kYHrOU9hF9AQU0sD8MM7+F56gU6xaI1jjB5+wQay9w/jLNviOetya7zvPXch GCibVfJyxI31RjE3qiCendiAu8YQSsbDf/1kd+mxdw/7Yk6uo+gvyW3uKGOdZy9bGC+P f6AfG+cN7UIfR/20nHqSA6ODKq6Za82OnJo62+9DObtkPG1xFnY4PoV6J6fxWBtxFwsa K2tZ3GX/fPrk0jTT9cDkMVQmy+sUTCPp9lxtYzpdf/CQNjvcmGHo+CDYsaxjIX/IuFoV /A== Received: from aserv0022.oracle.com (aserv0022.oracle.com [141.146.126.234]) by userp2120.oracle.com with ESMTP id 2nh4aqg2cp-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:14 +0000 Received: from userv0122.oracle.com (userv0122.oracle.com [156.151.31.75]) by aserv0022.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuDIT022159 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:13 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by userv0122.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuCeZ010802; Mon, 5 Nov 2018 16:56:12 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:11 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 04/13] ktask: run helper threads at MAX_NICE Date: Mon, 5 Nov 2018 11:55:49 -0500 Message-Id: <20181105165558.11698-5-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Multithreading may speed long-running kernel tasks, but overly optimistic parallelization can go wrong if too many helper threads are started on an already-busy system. Such helpers can degrade the performance of other tasks, so they should be sensitive to current CPU utilization[1]. To achieve this, run helpers at MAX_NICE so that their CPU time is proportional to idle CPU time. The main thread that called into ktask naturally runs at its original priority so that it can make progress on a heavily loaded system, as it would if ktask were not in the picture. I tested two different cases in which a non-ktask and a ktask workload compete for the same CPUs with the goal of showing that normal priority (i.e. nice=0) ktask helpers cause the non-ktask workload to run more slowly, whereas MAX_NICE ktask helpers don't. Testing notes: - Each case was run using 8 CPUs on a large two-socket server, with a cpumask allowing all test threads to run anywhere within the 8. - The non-ktask workload used 7 threads and the ktask workload used 8 threads to evaluate how much ktask helpers, rather than the main ktask thread, disturbed the non-ktask workload. - The non-ktask workload was started after the ktask workload and run for less time to maximize the chances that the non-ktask workload would be disturbed. - Runtimes in seconds. Case 1: Synthetic, worst-case CPU contention ktask_test - a tight loop doing integer multiplication to max out on CPU; used for testing only, does not appear in this series stress-ng - cpu stressor ("-c --cpu-method ackerman --cpu-ops 1200"); stress-ng alone (stdev) max_nice (stdev) normal_prio (stdev) ------------------------------------------------------------ ktask_test 96.87 ( 1.09) 90.81 ( 0.29) stress-ng 43.04 ( 0.00) 43.58 ( 0.01) 75.86 ( 0.39) This case shows MAX_NICE helpers make a significant difference compared to normal priority helpers, with stress-ng taking 76% longer to finish when competing with normal priority ktask threads than when run by itself, but only 1% longer when run with MAX_NICE helpers. The 1% comes from the small amount of CPU time MAX_NICE threads are given despite their low priority. Case 2: Real-world CPU contention ktask_vfio - VFIO page pin a 175G kvm guest usemem - faults in 25G of anonymous THP per thread, PAGE_SIZE stride; used to mimic the page clearing that dominates in ktask_vfio so that usemem competes for the same system resources usemem alone (stdev) max_nice (stdev) normal_prio (stdev) ------------------------------------------------------------ ktask_vfio 14.74 ( 0.04) 9.93 ( 0.09) usemem 10.45 ( 0.04) 10.75 ( 0.04) 14.14 ( 0.07) In the more realistic case 2, the effect is similar although not as pronounced. The usemem threads take 35% longer to finish with normal priority ktask threads than when run alone, but only 3% longer when MAX_NICE is used. All ktask users outside of VFIO boil down to page clearing, so I imagine the results would be similar for them. [1] lkml.kernel.org/r/20171206143509.GG7515@dhcp22.suse.cz Signed-off-by: Daniel Jordan --- kernel/ktask.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/kernel/ktask.c b/kernel/ktask.c index b91c62f14dcd..72293a0f50c3 100644 --- a/kernel/ktask.c +++ b/kernel/ktask.c @@ -575,6 +575,18 @@ void __init ktask_init(void) goto alloc_fail; } + /* + * All ktask worker threads have the lowest priority on the system so + * they don't disturb other workloads. + */ + attrs->nice = MAX_NICE; + + ret = apply_workqueue_attrs(ktask_wq, attrs); + if (ret != 0) { + pr_warn("disabled (couldn't apply attrs to ktask_wq)"); + goto apply_fail; + } + attrs->no_numa = true; ret = apply_workqueue_attrs(ktask_nonuma_wq, attrs); From patchwork Mon Nov 5 16:55:50 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668659 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5F6AF14BD for ; Mon, 5 Nov 2018 16:56:47 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4B767283A8 for ; Mon, 5 Nov 2018 16:56:47 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 3F4FC293D9; Mon, 5 Nov 2018 16:56:47 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CAE17283A8 for ; Mon, 5 Nov 2018 16:56:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 381FE6B026E; Mon, 5 Nov 2018 11:56:38 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 203DA6B0273; Mon, 5 Nov 2018 11:56:38 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EB0186B0270; Mon, 5 Nov 2018 11:56:37 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-io1-f70.google.com (mail-io1-f70.google.com [209.85.166.70]) by kanga.kvack.org (Postfix) with ESMTP id B7B6E6B026F for ; Mon, 5 Nov 2018 11:56:37 -0500 (EST) Received: by mail-io1-f70.google.com with SMTP id v23-v6so10865822ioh.16 for ; Mon, 05 Nov 2018 08:56:37 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=ocotEYS5gwBRk6gU87/nqGF4VbCejB5eUqqLuMjIfMQ=; b=k54frTXEgSof49duBLh1GTtWdtJcmK29KlzbtQBVjFkiQsVQamPa+jhmZsgzo9ZXuV WCSduIHEKRuxqvCG4Ux9s7O9ZgrGMlLFEUX3PNfJvnRWu1cJ3q41ATWvNeQetaN838uZ S6jkZ+pwT6vq69jnmQU5KNUKTF1ONcpL/Jchc7p30RxyVDR0RjP4tWsdmNnp3+Db7Cgi gak17xDyoRloLnHriifygFQUuGGeVqpMcZLxPMe9DKF/4/t/bSgyIHl3/DDuwUY2raXg XNuA4DcJiC1SkkhllHlZfSoRmzR+m6CP4cBSeeHBx8fLpydAtM5pKoBLlfJXS/SVUyZp MG5g== X-Gm-Message-State: AGRZ1gJPSdH2EyCMB/qjZTC/OS3MUc97mTQcxSK25pzWMPUE35bRtnkt jX4wJnmG42RYtkXxNva85+kFEN9RIvG08HoU+cFKQYFs14U5YSLIM+KUi1vj3PhQJDkwpJ/k64W lXpWnnY1bbymZqox8siwWvsqVmrKkhGm4BmCmhCefhFvNJ5SEiFdQ91GuvG6+4eByQQ== X-Received: by 2002:a6b:bb44:: with SMTP id l65-v6mr18429573iof.185.1541436997302; Mon, 05 Nov 2018 08:56:37 -0800 (PST) X-Google-Smtp-Source: AJdET5c7QQ4VPITsX68c/0fuZzI24g+QBLNSktqJxCp6lLGdxtjQK2MvYJmZJC16avP7vVy7cVoh X-Received: by 2002:a6b:bb44:: with SMTP id l65-v6mr18429499iof.185.1541436995797; Mon, 05 Nov 2018 08:56:35 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541436995; cv=none; d=google.com; s=arc-20160816; b=BIWVePbgrlVDImUiz2RjjXpwDE8H5A3HfZtqDBwH2p8Rmif6WyRZgmJacDwh/TgjHF 4onCsvNRGSKh7+zBsaM4Bc67mUtCVwZwrQkWlP6ZNyItkegszXV2btlTGw81lP+0T2aZ HmlxPNJSU4dJkTkrNUcI4Tyj4WsBFThMABe1gya6Az0Twb3QzCKbwFtS/dc1AUeU3ZPY b0CJ7fVRdgFxU2tYwLNIC4ebbfqzsoQZdt9JbBvu0WoCfkQovtfHa97Fn7rRFgHRJkl9 w3ZTILE2OXYc/pt2PdU2TVxIOsqX9lHemPKiZtEQ5xK8bdTI7u+HDsIQDFNX5APXn7Qu 7lQA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:dkim-signature; bh=ocotEYS5gwBRk6gU87/nqGF4VbCejB5eUqqLuMjIfMQ=; b=aCjE4CgPOfpllDvpaPFZ39r57XgN1GajLAt4ZwMKsDRBYusmmWkrOmf0WNXRMP5pZq 9rcgRXZDAAuK1moy4lsxiQM9MfuN1Ghv+27loO826pXntfOqLWFrRbG6BMGoQqx0H/Wz 2p+FVG3XCebtnmHYt9aO01KyQmdaEAEPYgFhysfDFY1bw/uEK3ekYtRkkV5K3mwXze7t xat6FVoNLaXagrQT54xRw+GX3DcabtzYLo62P64kDDhMNq0VaSioc9zHfUOdaCgTKiYn 0e9HysPuElBtPBWgk34Gsxo413OE51LaW/c4PPvbKK/nmvs7hj8BNveaEj44Ujq0q6NG qLrA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=mN54KvD5; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.85 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from userp2120.oracle.com (userp2120.oracle.com. [156.151.31.85]) by mx.google.com with ESMTPS id x5-v6si28285916iob.138.2018.11.05.08.56.35 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 05 Nov 2018 08:56:35 -0800 (PST) Received-SPF: pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.85 as permitted sender) client-ip=156.151.31.85; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=mN54KvD5; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.85 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5Gs0Tw052185; Mon, 5 Nov 2018 16:56:16 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2018-07-02; bh=ocotEYS5gwBRk6gU87/nqGF4VbCejB5eUqqLuMjIfMQ=; b=mN54KvD5juUsCho+IUsHz9pdJiCENCu2Zk68E3tt4J+ZeqbK+u8lhaGnR6ede6YIV8Zi G0cJxC6Hstx3BlUz8Tv2uPx3bhRvWsGAXx9kbwg8Ly8VAVJlRrsl5c9iryriKxkNZdLB 77woyrxNGrg2hihBo1BHyHVoMpZzRMYwRqNLTV56De9PZFqzHVyjhuejpkJXBbSHW6Pe Ctie/HE8JUJ4+mwTFYPCAPSXCpILnHber5mldAbJ5QcWes73oNe55l7E7j/3l8EDceln qT51C5PTnfliZqYGCd7Ow1YgJLGX+/sSMn9Xw/XrlqskUtf5cqAZvluvCa0F7Y8TmInw ow== Received: from aserv0022.oracle.com (aserv0022.oracle.com [141.146.126.234]) by userp2120.oracle.com with ESMTP id 2nh4aqg2cx-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:16 +0000 Received: from userv0122.oracle.com (userv0122.oracle.com [156.151.31.75]) by aserv0022.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuFfJ022245 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:15 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by userv0122.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuEud010809; Mon, 5 Nov 2018 16:56:14 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:13 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 05/13] workqueue, ktask: renice helper threads to prevent starvation Date: Mon, 5 Nov 2018 11:55:50 -0500 Message-Id: <20181105165558.11698-6-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=2 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP With ktask helper threads running at MAX_NICE, it's possible for one or more of them to begin chunks of the task and then have their CPU time constrained by higher priority threads. The main ktask thread, running at normal priority, may finish all available chunks of the task and then wait on the MAX_NICE helpers to finish the last in-progress chunks, for longer than it would have if no helpers were used. Avoid this by having the main thread assign its priority to each unfinished helper one at a time so that on a heavily loaded system, exactly one thread in a given ktask call is running at the main thread's priority. At least one thread to ensure forward progress, and at most one thread to limit excessive multithreading. Since the workqueue interface, on which ktask is built, does not provide access to worker threads, ktask can't adjust their priorities directly, so add a new interface to allow a previously-queued work item to run at a different priority than the one controlled by the corresponding workqueue's 'nice' attribute. The worker assigned to the work item will run the work at the given priority, temporarily overriding the worker's priority. The interface is flush_work_at_nice, which ensures the given work item's assigned worker runs at the specified nice level and waits for the work item to finish. An alternative choice would have been to simply requeue the work item to a pool with workers of the new priority, but this doesn't seem feasible because a worker may have already started executing the work and there's currently no way to interrupt it midway through. The proposed interface solves this issue because a worker's priority can be adjusted while it's executing the work. TODO: flush_work_at_nice is a proof-of-concept only, and it may be desired to have the interface set the work's nice without also waiting for it to finish. It's implemented in the flush path for this RFC because it was fairly simple to write ;-) I ran tests similar to the ones in the last patch with a couple of differences: - The non-ktask workload uses 8 CPUs instead of 7 to compete with the main ktask thread as well as the ktask helpers, so that when the main thread finishes, its CPU is completely occupied by the non-ktask workload, meaning MAX_NICE helpers can't run as often. - The non-ktask workload starts before the ktask workload, rather than after, to maximize the chance that it starves helpers. Runtimes in seconds. Case 1: Synthetic, worst-case CPU contention ktask_test - a tight loop doing integer multiplication to max out on CPU; used for testing only, does not appear in this series stress-ng - cpu stressor ("-c --cpu-method ackerman --cpu-ops 1200"); 8_ktask_thrs 8_ktask_thrs w/o_renice (stdev) with_renice (stdev) 1_ktask_thr (stdev) ------------------------------------------------------------------ ktask_test 41.98 ( 0.22) 25.15 ( 2.98) 30.40 ( 0.61) stress-ng 44.79 ( 1.11) 46.37 ( 0.69) 53.29 ( 1.91) Without renicing, ktask_test finishes just after stress-ng does because stress-ng needs to free up CPUs for the helpers to finish (ktask_test shows a shorter runtime than stress-ng because ktask_test was started later). Renicing lets ktask_test finish 40% sooner, and running the same amount of work in ktask_test with 1 thread instead of 8 finishes in a comparable amount of time, though longer than "with_renice" because MAX_NICE threads still get some CPU time, and the effect over 8 threads adds up. stress-ng's total runtime gets a little longer going from no renicing to renicing, as expected, because each reniced ktask thread takes more CPU time than before when the helpers were starved. Running with one ktask thread, stress-ng's reported walltime goes up because that single thread interferes with fewer stress-ng threads, but with more impact, causing a greater spread in the time it takes for individual stress-ng threads to finish. Averages of the per-thread stress-ng times from "with_renice" to "1_ktask_thr" come out roughly the same, though, 43.81 and 43.89 respectively. So the total runtime of stress-ng across all threads is unaffected, but the time stress-ng takes to finish running its threads completely actually improves by spreading the ktask_test work over more threads. Case 2: Real-world CPU contention ktask_vfio - VFIO page pin a 32G kvm guest usemem - faults in 86G of anonymous THP per thread, PAGE_SIZE stride; used to mimic the page clearing that dominates in ktask_vfio so that usemem competes for the same system resources 8_ktask_thrs 8_ktask_thrs w/o_renice (stdev) with_renice (stdev) 1_ktask_thr (stdev) ------------------------------------------------------------------ ktask_vfio 18.59 ( 0.19) 14.62 ( 2.03) 16.24 ( 0.90) usemem 47.54 ( 0.89) 48.18 ( 0.77) 49.70 ( 1.20) These results are similar to case 1's, though the differences between times are not quite as pronounced because ktask_vfio ran shorter compared to usemem. Signed-off-by: Daniel Jordan --- include/linux/workqueue.h | 5 ++ kernel/ktask.c | 81 ++++++++++++++++++----------- kernel/workqueue.c | 106 +++++++++++++++++++++++++++++++++++--- 3 files changed, 156 insertions(+), 36 deletions(-) diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index 60d673e15632..d2976547c9c3 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -95,6 +95,10 @@ enum { WORK_BUSY_PENDING = 1 << 0, WORK_BUSY_RUNNING = 1 << 1, + /* flags for flush_work and similar functions */ + WORK_FLUSH_FROM_CANCEL = 1 << 0, + WORK_FLUSH_AT_NICE = 1 << 1, + /* maximum string length for set_worker_desc() */ WORKER_DESC_LEN = 24, }; @@ -477,6 +481,7 @@ extern int schedule_on_each_cpu(work_func_t func); int execute_in_process_context(work_func_t fn, struct execute_work *); extern bool flush_work(struct work_struct *work); +extern bool flush_work_at_nice(struct work_struct *work, long nice); extern bool cancel_work_sync(struct work_struct *work); extern bool flush_delayed_work(struct delayed_work *dwork); diff --git a/kernel/ktask.c b/kernel/ktask.c index 72293a0f50c3..9d2727ce430c 100644 --- a/kernel/ktask.c +++ b/kernel/ktask.c @@ -16,7 +16,6 @@ #include #include -#include #include #include #include @@ -24,6 +23,7 @@ #include #include #include +#include #include #include #include @@ -41,6 +41,11 @@ static size_t *ktask_rlim_node_max; #define KTASK_CPUFRAC_NUMER 4 #define KTASK_CPUFRAC_DENOM 5 +enum ktask_work_flags { + KTASK_WORK_FINISHED = 1, + KTASK_WORK_UNDO = 2, +}; + /* Used to pass ktask data to the workqueue API. */ struct ktask_work { struct work_struct kw_work; @@ -53,6 +58,7 @@ struct ktask_work { void *kw_error_end; /* ktask_free_works, kn_failed_works linkage */ struct list_head kw_list; + enum ktask_work_flags kw_flags; }; static LIST_HEAD(ktask_free_works); @@ -68,10 +74,7 @@ struct ktask_task { struct ktask_node *kt_nodes; size_t kt_nr_nodes; size_t kt_nr_nodes_left; - size_t kt_nworks; - size_t kt_nworks_fini; int kt_error; /* first error from thread_func */ - struct completion kt_ktask_done; }; /* @@ -97,6 +100,7 @@ static void ktask_init_work(struct ktask_work *kw, struct ktask_task *kt, kw->kw_task = kt; kw->kw_ktask_node_i = ktask_node_i; kw->kw_queue_nid = queue_nid; + kw->kw_flags = 0; } static void ktask_queue_work(struct ktask_work *kw) @@ -171,7 +175,6 @@ static void ktask_thread(struct work_struct *work) struct ktask_task *kt = kw->kw_task; struct ktask_ctl *kc = &kt->kt_ctl; struct ktask_node *kn = &kt->kt_nodes[kw->kw_ktask_node_i]; - bool done; mutex_lock(&kt->kt_mutex); @@ -239,6 +242,7 @@ static void ktask_thread(struct work_struct *work) * about where this thread failed for ktask_undo. */ if (kc->kc_undo_func) { + kw->kw_flags |= KTASK_WORK_UNDO; list_move(&kw->kw_list, &kn->kn_failed_works); kw->kw_error_start = position; kw->kw_error_offset = position_offset; @@ -250,13 +254,8 @@ static void ktask_thread(struct work_struct *work) WARN_ON(kt->kt_nr_nodes_left > 0 && kt->kt_error == KTASK_RETURN_SUCCESS); - ++kt->kt_nworks_fini; - WARN_ON(kt->kt_nworks_fini > kt->kt_nworks); - done = (kt->kt_nworks_fini == kt->kt_nworks); + kw->kw_flags |= KTASK_WORK_FINISHED; mutex_unlock(&kt->kt_mutex); - - if (done) - complete(&kt->kt_ktask_done); } /* @@ -294,7 +293,7 @@ static size_t ktask_chunk_size(size_t task_size, size_t min_chunk_size, */ static size_t ktask_init_works(struct ktask_node *nodes, size_t nr_nodes, struct ktask_task *kt, - struct list_head *works_list) + struct list_head *unfinished_works) { size_t i, nr_works, nr_works_check; size_t min_chunk_size = kt->kt_ctl.kc_min_chunk_size; @@ -342,7 +341,7 @@ static size_t ktask_init_works(struct ktask_node *nodes, size_t nr_nodes, WARN_ON(list_empty(&ktask_free_works)); kw = list_first_entry(&ktask_free_works, struct ktask_work, kw_list); - list_move_tail(&kw->kw_list, works_list); + list_move_tail(&kw->kw_list, unfinished_works); ktask_init_work(kw, kt, ktask_node_i, queue_nid); ++ktask_rlim_cur; @@ -355,14 +354,14 @@ static size_t ktask_init_works(struct ktask_node *nodes, size_t nr_nodes, static void ktask_fini_works(struct ktask_task *kt, struct ktask_work *stack_work, - struct list_head *works_list) + struct list_head *finished_works) { struct ktask_work *work, *next; spin_lock(&ktask_rlim_lock); /* Put the works back on the free list, adjusting rlimits. */ - list_for_each_entry_safe(work, next, works_list, kw_list) { + list_for_each_entry_safe(work, next, finished_works, kw_list) { if (work == stack_work) { /* On this thread's stack, so not subject to rlimits. */ list_del(&work->kw_list); @@ -393,7 +392,7 @@ static int ktask_error_cmp(void *unused, struct list_head *a, } static void ktask_undo(struct ktask_node *nodes, size_t nr_nodes, - struct ktask_ctl *ctl, struct list_head *works_list) + struct ktask_ctl *ctl, struct list_head *finished_works) { size_t i; @@ -424,7 +423,8 @@ static void ktask_undo(struct ktask_node *nodes, size_t nr_nodes, if (failed_work) { undo_pos = failed_work->kw_error_end; - list_move(&failed_work->kw_list, works_list); + list_move(&failed_work->kw_list, + finished_works); } else { undo_pos = undo_end; } @@ -433,20 +433,46 @@ static void ktask_undo(struct ktask_node *nodes, size_t nr_nodes, } } +static void ktask_wait_for_completion(struct ktask_task *kt, + struct list_head *unfinished_works, + struct list_head *finished_works) +{ + struct ktask_work *work; + + mutex_lock(&kt->kt_mutex); + while (!list_empty(unfinished_works)) { + work = list_first_entry(unfinished_works, struct ktask_work, + kw_list); + if (!(work->kw_flags & KTASK_WORK_FINISHED)) { + mutex_unlock(&kt->kt_mutex); + flush_work_at_nice(&work->kw_work, task_nice(current)); + mutex_lock(&kt->kt_mutex); + WARN_ON_ONCE(!(work->kw_flags & KTASK_WORK_FINISHED)); + } + /* + * Leave works used in ktask_undo on kn->kn_failed_works. + * ktask_undo will move them to finished_works. + */ + if (!(work->kw_flags & KTASK_WORK_UNDO)) + list_move(&work->kw_list, finished_works); + } + mutex_unlock(&kt->kt_mutex); +} + int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes, struct ktask_ctl *ctl) { - size_t i; + size_t i, nr_works; struct ktask_work kw; struct ktask_work *work; - LIST_HEAD(works_list); + LIST_HEAD(unfinished_works); + LIST_HEAD(finished_works); struct ktask_task kt = { .kt_ctl = *ctl, .kt_total_size = 0, .kt_nodes = nodes, .kt_nr_nodes = nr_nodes, .kt_nr_nodes_left = nr_nodes, - .kt_nworks_fini = 0, .kt_error = KTASK_RETURN_SUCCESS, }; @@ -465,14 +491,12 @@ int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes, return KTASK_RETURN_SUCCESS; mutex_init(&kt.kt_mutex); - init_completion(&kt.kt_ktask_done); - kt.kt_nworks = ktask_init_works(nodes, nr_nodes, &kt, &works_list); + nr_works = ktask_init_works(nodes, nr_nodes, &kt, &unfinished_works); kt.kt_chunk_size = ktask_chunk_size(kt.kt_total_size, - ctl->kc_min_chunk_size, - kt.kt_nworks); + ctl->kc_min_chunk_size, nr_works); - list_for_each_entry(work, &works_list, kw_list) + list_for_each_entry(work, &unfinished_works, kw_list) ktask_queue_work(work); /* Use the current thread, which saves starting a workqueue worker. */ @@ -480,13 +504,12 @@ int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes, INIT_LIST_HEAD(&kw.kw_list); ktask_thread(&kw.kw_work); - /* Wait for all the jobs to finish. */ - wait_for_completion(&kt.kt_ktask_done); + ktask_wait_for_completion(&kt, &unfinished_works, &finished_works); if (kt.kt_error && ctl->kc_undo_func) - ktask_undo(nodes, nr_nodes, ctl, &works_list); + ktask_undo(nodes, nr_nodes, ctl, &finished_works); - ktask_fini_works(&kt, &kw, &works_list); + ktask_fini_works(&kt, &kw, &finished_works); mutex_destroy(&kt.kt_mutex); return kt.kt_error; diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 0280deac392e..9fbae3fc9cca 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -79,6 +79,7 @@ enum { WORKER_CPU_INTENSIVE = 1 << 6, /* cpu intensive */ WORKER_UNBOUND = 1 << 7, /* worker is unbound */ WORKER_REBOUND = 1 << 8, /* worker was rebound */ + WORKER_NICED = 1 << 9, /* worker's nice was adjusted */ WORKER_NOT_RUNNING = WORKER_PREP | WORKER_CPU_INTENSIVE | WORKER_UNBOUND | WORKER_REBOUND, @@ -2184,6 +2185,18 @@ __acquires(&pool->lock) if (unlikely(cpu_intensive)) worker_clr_flags(worker, WORKER_CPU_INTENSIVE); + /* + * worker's nice level was adjusted (see flush_work_at_nice). Use the + * work's color to distinguish between the work that sets the nice + * level (== NO_COLOR) and the work for which the adjustment was made + * (!= NO_COLOR) to avoid prematurely restoring the nice level. + */ + if (unlikely(worker->flags & WORKER_NICED && + work_color != WORK_NO_COLOR)) { + set_user_nice(worker->task, worker->pool->attrs->nice); + worker_clr_flags(worker, WORKER_NICED); + } + /* we're done with it, release */ hash_del(&worker->hentry); worker->current_work = NULL; @@ -2846,8 +2859,53 @@ void drain_workqueue(struct workqueue_struct *wq) } EXPORT_SYMBOL_GPL(drain_workqueue); +struct nice_work { + struct work_struct work; + long nice; +}; + +static void nice_work_func(struct work_struct *work) +{ + struct nice_work *nw = container_of(work, struct nice_work, work); + struct worker *worker = current_wq_worker(); + + if (WARN_ON_ONCE(!worker)) + return; + + set_user_nice(current, nw->nice); + worker->flags |= WORKER_NICED; +} + +/** + * insert_nice_work - insert a nice_work into a pwq + * @pwq: pwq to insert nice_work into + * @nice_work: nice_work to insert + * @target: target work to attach @nice_work to + * + * @nice_work is linked to @target such that @target starts executing only + * after @nice_work finishes execution. + * + * @nice_work's only job is to ensure @target's assigned worker runs at the + * nice level contained in @nice_work. + * + * CONTEXT: + * spin_lock_irq(pool->lock). + */ +static void insert_nice_work(struct pool_workqueue *pwq, + struct nice_work *nice_work, + struct work_struct *target) +{ + /* see comment above similar code in insert_wq_barrier */ + INIT_WORK_ONSTACK(&nice_work->work, nice_work_func); + __set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&nice_work->work)); + + debug_work_activate(&nice_work->work); + insert_work(pwq, &nice_work->work, &target->entry, + work_color_to_flags(WORK_NO_COLOR) | WORK_STRUCT_LINKED); +} + static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr, - bool from_cancel) + struct nice_work *nice_work, int flags) { struct worker *worker = NULL; struct worker_pool *pool; @@ -2868,11 +2926,19 @@ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr, if (pwq) { if (unlikely(pwq->pool != pool)) goto already_gone; + + /* not yet started, insert linked work before work */ + if (unlikely(flags & WORK_FLUSH_AT_NICE)) + insert_nice_work(pwq, nice_work, work); } else { worker = find_worker_executing_work(pool, work); if (!worker) goto already_gone; pwq = worker->current_pwq; + if (unlikely(flags & WORK_FLUSH_AT_NICE)) { + set_user_nice(worker->task, nice_work->nice); + worker->flags |= WORKER_NICED; + } } check_flush_dependency(pwq->wq, work); @@ -2889,7 +2955,7 @@ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr, * workqueues the deadlock happens when the rescuer stalls, blocking * forward progress. */ - if (!from_cancel && + if (!(flags & WORK_FLUSH_FROM_CANCEL) && (pwq->wq->saved_max_active == 1 || pwq->wq->rescuer)) { lock_map_acquire(&pwq->wq->lockdep_map); lock_map_release(&pwq->wq->lockdep_map); @@ -2901,19 +2967,23 @@ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr, return false; } -static bool __flush_work(struct work_struct *work, bool from_cancel) +static bool __flush_work(struct work_struct *work, int flags, long nice) { struct wq_barrier barr; + struct nice_work nice_work; if (WARN_ON(!wq_online)) return false; - if (!from_cancel) { + if (!(flags & WORK_FLUSH_FROM_CANCEL)) { lock_map_acquire(&work->lockdep_map); lock_map_release(&work->lockdep_map); } - if (start_flush_work(work, &barr, from_cancel)) { + if (unlikely(flags & WORK_FLUSH_AT_NICE)) + nice_work.nice = nice; + + if (start_flush_work(work, &barr, &nice_work, flags)) { wait_for_completion(&barr.done); destroy_work_on_stack(&barr.work); return true; @@ -2935,10 +3005,32 @@ static bool __flush_work(struct work_struct *work, bool from_cancel) */ bool flush_work(struct work_struct *work) { - return __flush_work(work, false); + return __flush_work(work, 0, 0); } EXPORT_SYMBOL_GPL(flush_work); +/** + * flush_work_at_nice - set a work's nice level and wait for it to finish + * @work: the target work + * @nice: nice level @work's assigned worker should run at + * + * Makes @work's assigned worker run at @nice for the duration of @work. + * Waits until @work has finished execution. @work is guaranteed to be idle + * on return if it hasn't been requeued since flush started. + * + * Avoids priority inversion where a high priority task queues @work on a + * workqueue with low priority workers and may wait indefinitely for @work's + * completion. That task can will its priority to @work. + * + * Return: + * %true if flush_work_at_nice() waited for the work to finish execution, + * %false if it was already idle. + */ +bool flush_work_at_nice(struct work_struct *work, long nice) +{ + return __flush_work(work, WORK_FLUSH_AT_NICE, nice); +} + struct cwt_wait { wait_queue_entry_t wait; struct work_struct *work; @@ -3001,7 +3093,7 @@ static bool __cancel_work_timer(struct work_struct *work, bool is_dwork) * isn't executing. */ if (wq_online) - __flush_work(work, true); + __flush_work(work, WORK_FLUSH_FROM_CANCEL, 0); clear_work_data(work); From patchwork Mon Nov 5 16:55:51 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668665 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8EC4814BD for ; Mon, 5 Nov 2018 16:56:58 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7BCCF293D9 for ; Mon, 5 Nov 2018 16:56:58 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 6EB1B2946D; Mon, 5 Nov 2018 16:56:58 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9253D293D9 for ; Mon, 5 Nov 2018 16:56:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 33FE76B0272; Mon, 5 Nov 2018 11:56:41 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 2C6EF6B0273; Mon, 5 Nov 2018 11:56:41 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 18DFC6B0274; Mon, 5 Nov 2018 11:56:41 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yb1-f199.google.com (mail-yb1-f199.google.com [209.85.219.199]) by kanga.kvack.org (Postfix) with ESMTP id D63046B0272 for ; Mon, 5 Nov 2018 11:56:40 -0500 (EST) Received: by mail-yb1-f199.google.com with SMTP id b129-v6so7800395ybg.7 for ; Mon, 05 Nov 2018 08:56:40 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=W/3M3WgNkyPY7/D1fTX9bMKrquODyVvPod8Snk5MTFI=; b=jP+os9NmHJAK1XWg1dSFIJAU+V+ifOcrSn6/Eo+qUAQ5W0lNP92AY9fV4k8/1bmAK6 7/4Jcrm9wDEGE/37ZDkQHUbwPGk5Xnfzy7mf3BGW6dYCmmques+OUUyRxlvrYseynWxW cL5h2J+iobcXmY3gAi75TeIXx9UVoLGR4ahXbrRNqCGbOXrcjkSoUFEd6jW5TMFR+fKA revLYtOF/VVoTer4VeEy2ORhwRRVQB3ZcQ6PGjmacKnyUVS0+ONqfmt3JTkGo29gkRcm 8UN7wwUR+dYLIijmqzoC/DWCWpE4sJZyROsAxpYgZPkIcAiXncySi/zrLrqy/sk/wEXu 5gxA== X-Gm-Message-State: AGRZ1gKg8jkVvAXneVGnP7gzEAg7u981yh2gyVqoHpD+DKPij2/Qo733 qgMwEZ9E45T1fsTGl+O+/Zt8fgfPm2IA+mQlYCXfQSXRrBXv69DnNMvKE+U+ZN9wVNm78ZxN/UX dpV/XA3/RrOdl2q9wMVZLqf7RsaG6evX7jTk+rdohk6Y8INn5XsG5MAIzHyBDb0nX2A== X-Received: by 2002:a81:350c:: with SMTP id c12-v6mr22207504ywa.342.1541437000567; Mon, 05 Nov 2018 08:56:40 -0800 (PST) X-Google-Smtp-Source: AJdET5eYt2d8lwIhWhOSCRFftmguqBkrFqkyxooqe4wsqByNo9YmlGgp20WO0iOfHhe4KbHte0UE X-Received: by 2002:a81:350c:: with SMTP id c12-v6mr22207445ywa.342.1541436999634; Mon, 05 Nov 2018 08:56:39 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541436999; cv=none; d=google.com; s=arc-20160816; b=DrnEswf+Y5M+jCAQMvrEj3zNnsmtW0PDcRvBU25htoL7HR36PySylXfdamfAFSB/G1 luWBpiAcS0jUiFyI/X+Fh4VZbTiAJDiizcpxVvUa2wnVG+hoLu1Wei8JKFGWEnUFkKoC agxb2yJK/u90bZR7hzG7/Z0EVVI8jsfCqMwLGy14LX3vp8mtLtWu6sMR0sv9Wrblk1D5 rk8xbpvYdRKl7LVD5f6T7GqVPYL1q4sTdeVAl+xc3z0B4iEtDxNozWkwvCkV1F5lFbm/ nO+gIG7EPbgPC7OicOt2wXoFhbsGkIGdmmkaIfCTW2SyxHGZzMM1Trrfutws8Uos/0LJ sdNw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:dkim-signature; bh=W/3M3WgNkyPY7/D1fTX9bMKrquODyVvPod8Snk5MTFI=; b=TTXd+ilQLDFWTVaFvPG/Oag2RjbRSMQFCIuu+FWsybIhniFR6Lo62ILSgLSxjfVrQM zEpNfnDVzdwQfMfjCeNtCh2H4ifsYe8SMGjeNdRtSXqLSS2UhWFL9wPmE+R1FEatAFiz ssTyk38X/V+ARkfkMabPkxprYM0U6qB4iwSoVAWPg804jvZ1jL/w0uB0EpSXzkLlJMnl XWKv4vqOO77IliSjrwHflNCW8x1Z9mJy2FfbQLDgJNLqvGK3azkNfZ5dPgPOpRXho3TE S2OReP55hR+DYzc9ObsuFRq/ojK3r6brx8O9mXxgmQ3TsQa6e8rm/BSe8Ka/s1k++RdK 1bCw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=qleIjIaY; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.86 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from userp2130.oracle.com (userp2130.oracle.com. [156.151.31.86]) by mx.google.com with ESMTPS id y2-v6si19163840ybi.420.2018.11.05.08.56.39 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 05 Nov 2018 08:56:39 -0800 (PST) Received-SPF: pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.86 as permitted sender) client-ip=156.151.31.86; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=qleIjIaY; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.86 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5GsADv097225; Mon, 5 Nov 2018 16:56:23 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2018-07-02; bh=W/3M3WgNkyPY7/D1fTX9bMKrquODyVvPod8Snk5MTFI=; b=qleIjIaYJDexWrd2rkhbiQVJrDE7fdLiJs5O9JaxGpJ1v+bzuJu5C6YgyaHn1KnTaX35 bwg0Zw+scNeWJC2Q2LA8tLzwOYwoWPjS7+Z36OEWqdsr0xd6BJ5CpqLNuusLaorsTmFU 1oInGkb/CSCy/rqjH/GRsKSCnstAQlQiMHWaPbGas0m77RdN6TS7lePa8Sn/bYowqmuk 6HqP7REWoucTlRd1X0rd7eH2n5cnxFPy6vin82CMBSiLhbOZVz2LnywpTFMa6C0Rr0Gi G1579hYh9MYMwuYWFfV8m5h2mrK0i8oiis619swGfHr7SspVZa6VbpkDV8V49Cv0dljf mQ== Received: from aserv0022.oracle.com (aserv0022.oracle.com [141.146.126.234]) by userp2130.oracle.com with ESMTP id 2nh33tr6be-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:23 +0000 Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by aserv0022.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuHRq022308 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:17 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id wA5GuGGF008403; Mon, 5 Nov 2018 16:56:16 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:15 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 06/13] vfio: parallelize vfio_pin_map_dma Date: Mon, 5 Nov 2018 11:55:51 -0500 Message-Id: <20181105165558.11698-7-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP When starting a large-memory kvm guest, it takes an excessively long time to start the boot process because qemu must pin all guest pages to accommodate DMA when VFIO is in use. Currently just one CPU is responsible for the page pinning, which usually boils down to page clearing time-wise, so the ways to optimize this are buying a faster CPU ;-) or using more of the CPUs you already have. Parallelize with ktask. Refactor so workqueue workers pin with the mm of the calling thread, and to enable an undo callback for ktask to handle errors during page pinning. Performance results appear later in the series. Signed-off-by: Daniel Jordan --- drivers/vfio/vfio_iommu_type1.c | 106 +++++++++++++++++++++++--------- 1 file changed, 76 insertions(+), 30 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index d9fd3188615d..e7cfbf0c8071 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -41,6 +41,7 @@ #include #include #include +#include #define DRIVER_VERSION "0.2" #define DRIVER_AUTHOR "Alex Williamson " @@ -395,7 +396,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, */ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, long npage, unsigned long *pfn_base, - unsigned long limit) + unsigned long limit, struct mm_struct *mm) { unsigned long pfn = 0; long ret, pinned = 0, lock_acct = 0; @@ -403,10 +404,10 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, dma_addr_t iova = vaddr - dma->vaddr + dma->iova; /* This code path is only user initiated */ - if (!current->mm) + if (!mm) return -ENODEV; - ret = vaddr_get_pfn(current->mm, vaddr, dma->prot, pfn_base); + ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base); if (ret) return ret; @@ -418,7 +419,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, * pages are already counted against the user. */ if (!rsvd && !vfio_find_vpfn(dma, iova)) { - if (!dma->lock_cap && current->mm->locked_vm + 1 > limit) { + if (!dma->lock_cap && mm->locked_vm + 1 > limit) { put_pfn(*pfn_base, dma->prot); pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__, limit << PAGE_SHIFT); @@ -433,7 +434,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, /* Lock all the consecutive pages from pfn_base */ for (vaddr += PAGE_SIZE, iova += PAGE_SIZE; pinned < npage; pinned++, vaddr += PAGE_SIZE, iova += PAGE_SIZE) { - ret = vaddr_get_pfn(current->mm, vaddr, dma->prot, &pfn); + ret = vaddr_get_pfn(mm, vaddr, dma->prot, &pfn); if (ret) break; @@ -445,7 +446,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, if (!rsvd && !vfio_find_vpfn(dma, iova)) { if (!dma->lock_cap && - current->mm->locked_vm + lock_acct + 1 > limit) { + mm->locked_vm + lock_acct + 1 > limit) { put_pfn(pfn, dma->prot); pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__, limit << PAGE_SHIFT); @@ -752,15 +753,15 @@ static size_t unmap_unpin_slow(struct vfio_domain *domain, } static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma, + dma_addr_t iova, dma_addr_t end, bool do_accounting) { - dma_addr_t iova = dma->iova, end = dma->iova + dma->size; struct vfio_domain *domain, *d; LIST_HEAD(unmapped_region_list); int unmapped_region_cnt = 0; long unlocked = 0; - if (!dma->size) + if (iova == end) return 0; if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) @@ -777,7 +778,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma, struct vfio_domain, next); list_for_each_entry_continue(d, &iommu->domain_list, next) { - iommu_unmap(d->domain, dma->iova, dma->size); + iommu_unmap(d->domain, iova, end - iova); cond_resched(); } @@ -818,8 +819,6 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma, } } - dma->iommu_mapped = false; - if (unmapped_region_cnt) unlocked += vfio_sync_unpin(dma, domain, &unmapped_region_list); @@ -830,14 +829,21 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma, return unlocked; } -static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma) +static void vfio_remove_dma_finish(struct vfio_iommu *iommu, + struct vfio_dma *dma) { - vfio_unmap_unpin(iommu, dma, true); + dma->iommu_mapped = false; vfio_unlink_dma(iommu, dma); put_task_struct(dma->task); kfree(dma); } +static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma) +{ + vfio_unmap_unpin(iommu, dma, dma->iova, dma->iova + dma->size, true); + vfio_remove_dma_finish(iommu, dma); +} + static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu) { struct vfio_domain *domain; @@ -1031,20 +1037,29 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova, return ret; } -static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, - size_t map_size) +struct vfio_pin_args { + struct vfio_iommu *iommu; + struct vfio_dma *dma; + unsigned long limit; + struct mm_struct *mm; +}; + +static int vfio_pin_map_dma_chunk(unsigned long start_vaddr, + unsigned long end_vaddr, + struct vfio_pin_args *args) { - dma_addr_t iova = dma->iova; - unsigned long vaddr = dma->vaddr; - size_t size = map_size; + struct vfio_dma *dma = args->dma; + dma_addr_t iova = dma->iova + (start_vaddr - dma->vaddr); + unsigned long unmapped_size = end_vaddr - start_vaddr; + unsigned long pfn, mapped_size = 0; long npage; - unsigned long pfn, limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; int ret = 0; - while (size) { + while (unmapped_size) { /* Pin a contiguous chunk of memory */ - npage = vfio_pin_pages_remote(dma, vaddr + dma->size, - size >> PAGE_SHIFT, &pfn, limit); + npage = vfio_pin_pages_remote(dma, start_vaddr + mapped_size, + unmapped_size >> PAGE_SHIFT, + &pfn, args->limit, args->mm); if (npage <= 0) { WARN_ON(!npage); ret = (int)npage; @@ -1052,22 +1067,50 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, } /* Map it! */ - ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, - dma->prot); + ret = vfio_iommu_map(args->iommu, iova + mapped_size, pfn, + npage, dma->prot); if (ret) { - vfio_unpin_pages_remote(dma, iova + dma->size, pfn, + vfio_unpin_pages_remote(dma, iova + mapped_size, pfn, npage, true); break; } - size -= npage << PAGE_SHIFT; - dma->size += npage << PAGE_SHIFT; + unmapped_size -= npage << PAGE_SHIFT; + mapped_size += npage << PAGE_SHIFT; } + return (ret == 0) ? KTASK_RETURN_SUCCESS : ret; +} + +static void vfio_pin_map_dma_undo(unsigned long start_vaddr, + unsigned long end_vaddr, + struct vfio_pin_args *args) +{ + struct vfio_dma *dma = args->dma; + dma_addr_t iova = dma->iova + (start_vaddr - dma->vaddr); + dma_addr_t end = dma->iova + (end_vaddr - dma->vaddr); + + vfio_unmap_unpin(args->iommu, args->dma, iova, end, true); +} + +static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, + size_t map_size) +{ + unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; + int ret = 0; + struct vfio_pin_args args = { iommu, dma, limit, current->mm }; + /* Stay on PMD boundary in case THP is being used. */ + DEFINE_KTASK_CTL(ctl, vfio_pin_map_dma_chunk, &args, PMD_SIZE); + + ktask_ctl_set_undo_func(&ctl, vfio_pin_map_dma_undo); + ret = ktask_run((void *)dma->vaddr, map_size, &ctl); + dma->iommu_mapped = true; if (ret) - vfio_remove_dma(iommu, dma); + vfio_remove_dma_finish(iommu, dma); + else + dma->size += map_size; return ret; } @@ -1229,7 +1272,8 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, npage = vfio_pin_pages_remote(dma, vaddr, n >> PAGE_SHIFT, - &pfn, limit); + &pfn, limit, + current->mm); if (npage <= 0) { WARN_ON(!npage); ret = (int)npage; @@ -1497,7 +1541,9 @@ static void vfio_iommu_unmap_unpin_reaccount(struct vfio_iommu *iommu) long locked = 0, unlocked = 0; dma = rb_entry(n, struct vfio_dma, node); - unlocked += vfio_unmap_unpin(iommu, dma, false); + unlocked += vfio_unmap_unpin(iommu, dma, dma->iova, + dma->iova + dma->size, false); + dma->iommu_mapped = false; p = rb_first(&dma->pfn_list); for (; p; p = rb_next(p)) { struct vfio_pfn *vpfn = rb_entry(p, struct vfio_pfn, From patchwork Mon Nov 5 16:55:52 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668667 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CF29814BD for ; Mon, 5 Nov 2018 16:57:03 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id BBBDB293D9 for ; Mon, 5 Nov 2018 16:57:03 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id AEF4C29536; Mon, 5 Nov 2018 16:57:03 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7B643293D9 for ; Mon, 5 Nov 2018 16:57:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3B3AA6B0273; Mon, 5 Nov 2018 11:56:42 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 363C96B0275; Mon, 5 Nov 2018 11:56:42 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1DD326B0276; Mon, 5 Nov 2018 11:56:42 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yb1-f200.google.com (mail-yb1-f200.google.com [209.85.219.200]) by kanga.kvack.org (Postfix) with ESMTP id DDA596B0273 for ; Mon, 5 Nov 2018 11:56:41 -0500 (EST) Received: by mail-yb1-f200.google.com with SMTP id t60-v6so7925539ybi.4 for ; Mon, 05 Nov 2018 08:56:41 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=kTwyB7gUqH7ZPYa5YenpKewBt6cK4ioVbtrtT8tR8jw=; b=t+1gDhAppU99XB+U0fSRqWL2H9B+RFwnvHtscxC3WDRRqYPbvON/k2zAn4j0kSHrhv u2A/9j53IsxwqkMqjcc1u1avyWofgIw4w9LVvtdXk7P7GnEIDwGxfnt6Mjmnjb2JNI+A PscZjNCov42a8eXtb6m5QOICt47arvpcESwKxfRZ9/qo2rvXgl1Ynv/G/eA50vYFXyQ6 O1TkbuyWttAQgaKscKWkW+tVoThv0e3dbO03bu3WGhwudH+5GegcGUirt7GjVsmK6+sB 9OyPdjJ2Vvkj9xfIMc5J5Wy1YQmuJPH7fp9CgdcJEwyu8tYce5qoqbbWuZwGKFYb87jE Uygw== X-Gm-Message-State: AGRZ1gILDftYOk332DaAF1DzL1JJNrcjwfDKL3C5YmZKkKhjhJ/qbv0d Xw5wq2YMVkI1dRTfZwmfXK6T8RZUb+naDDVsKweH6ePScWwFVNxiAAb2RhwKGMBe7yRlB5Z29it zCbBGN4QcTZWbKOn7uUewlcAbYZqrzrPK8BW2pIxXdpFC8tRPHoRSqSVMLojG/N1ifw== X-Received: by 2002:a81:4901:: with SMTP id w1-v6mr20881557ywa.86.1541437001542; Mon, 05 Nov 2018 08:56:41 -0800 (PST) X-Google-Smtp-Source: AJdET5ebEzh9axY0wmJoRzYa+w4Rl4IUcqr/FFS6B1CQ17jaMj8JdN3OFT4XR1Y2YnPT7LUF9a5s X-Received: by 2002:a81:4901:: with SMTP id w1-v6mr20881467ywa.86.1541437000135; Mon, 05 Nov 2018 08:56:40 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541437000; cv=none; d=google.com; s=arc-20160816; b=ffaE/Dg7Mfz60qHAbZUo2jTzutCUoM1D/j9SMUGctxV/AIn4iIwSzSRwcWjnKurNrY G2MiWQ+8Ej3EOywfTG0biLv6VvRgaklCmuV5ILg9/Z7WnwqndKoKhQrizmQhJHG0bnhG sZShfXvfyoD341jE/IZxeLhP3n8eORfDKmM/f7RXQ2TjRcA4UEAFi0LAv1NqzdJm3VzT 6DknzCgzlktJj89LvXuX46peozBYPKnMrbR+FLJHAfIa5JFUNQneuNFbHt+CXIVEh4zk GaKsU7BnhvwSd3AXQXsSaEd4dnh7l6tvtX5zWNB3xJ7++z0tq2HDtImM7als4GAM8nNC HX6g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:dkim-signature; bh=kTwyB7gUqH7ZPYa5YenpKewBt6cK4ioVbtrtT8tR8jw=; b=pV1zF1k8h/7kkG5CM5hUnAfSQE6c/SBglC0vtYHWx+z+SP4lYvJwb9K8uwS2ioS/yU PUU6n+ga+eWIwYrHMTMGPU8JHAVO3RaOqDFg9PIi2yARQ9q+yLlvWo6SVE29TSV5Wi2A fAGnqbsqzKnqtZzsdbq9hO8kc+uC84tNjNTH/i7uqrwQ7jUlibsUFtUpr4W2llCbzIPP U16bMaDHkhm8iG5O0UGiZAIWpCpDIFVZFH9aZ9FbMIyoGwV0C8Ip7SjPQ6NoUnVtBXLO 93f48g2uRp52SAUQEwN6286yrP8e3sG1MeGblGvHKdEtcFXTlWn6PTLQ+yyhKdPZtqrA Lcug== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=2qCbR5C3; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.86 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from userp2130.oracle.com (userp2130.oracle.com. [156.151.31.86]) by mx.google.com with ESMTPS id b8-v6si21843479ybj.220.2018.11.05.08.56.39 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 05 Nov 2018 08:56:40 -0800 (PST) Received-SPF: pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.86 as permitted sender) client-ip=156.151.31.86; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=2qCbR5C3; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.86 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5Gs1eX097148; Mon, 5 Nov 2018 16:56:19 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2018-07-02; bh=kTwyB7gUqH7ZPYa5YenpKewBt6cK4ioVbtrtT8tR8jw=; b=2qCbR5C3zZAqth0hgukqdlybcV4uHhL5fs1YXUBigjThoXJj6vgpqR/0m6GGO1TvcjUO GKEA2ixpVWQNoigKTcwSGoaVMvCtkkXYNzBXrlIfd8itxzOhfXQLkXswQXI6/2VnsuWv Jz7vdnLJMcMqu7h2zCsaiSNo7C+J5Ihn0Ttn3aIPNsvQBr5GebXjUrjemEbEgbxHgI6M qdex2jnmyZCoZ2ZY+WLq0y/n2abwOCE76N83Ww4e6k0uyMumWH8taJitb/NgdpmYWGjI ZKwU6nWWsP0KzteRuU69AVpdGlTPqYWIYMsFD/7qvy0a+50JdFlFpQSE3uK/W+0ob06L CA== Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by userp2130.oracle.com with ESMTP id 2nh33tr6ay-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:18 +0000 Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuILC004739 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:18 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuHKW017779; Mon, 5 Nov 2018 16:56:17 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:17 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 07/13] mm: change locked_vm's type from unsigned long to atomic_long_t Date: Mon, 5 Nov 2018 11:55:52 -0500 Message-Id: <20181105165558.11698-8-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Currently, mmap_sem must be held as writer to modify the locked_vm field in mm_struct. This creates a bottleneck when multithreading VFIO page pinning because each thread holds the mmap_sem as reader for the majority of the pinning time but also takes mmap_sem as writer regularly, for short times, when modifying locked_vm. The problem gets worse when other workloads compete for CPU with ktask threads doing page pinning because the other workloads force ktask threads that hold mmap_sem as writer off the CPU, blocking ktask threads trying to get mmap_sem as reader for an excessively long time (the mmap_sem reader wait time grows linearly with the thread count). Requiring mmap_sem for locked_vm also abuses mmap_sem by making it protect data that could be synchronized separately. So, decouple locked_vm from mmap_sem by making locked_vm an atomic_long_t. locked_vm's old type was unsigned long and changing it to a signed type makes it lose half its capacity, but that's only a concern for 32-bit systems and LONG_MAX * PAGE_SIZE is 8T on x86 in that case, so there's headroom. Now that mmap_sem is not taken as writer here, ktask threads holding mmap_sem as reader can run more often. Performance results appear later in the series. On powerpc, this was cross-compiled-tested only. [XXX Can send separately.] Signed-off-by: Daniel Jordan --- arch/powerpc/kvm/book3s_64_vio.c | 15 ++++++++------- arch/powerpc/mm/mmu_context_iommu.c | 16 ++++++++-------- drivers/fpga/dfl-afu-dma-region.c | 16 +++++++++------- drivers/vfio/vfio_iommu_spapr_tce.c | 14 +++++++------- drivers/vfio/vfio_iommu_type1.c | 11 ++++++----- fs/proc/task_mmu.c | 2 +- include/linux/mm_types.h | 2 +- kernel/fork.c | 2 +- mm/debug.c | 3 ++- mm/mlock.c | 4 ++-- mm/mmap.c | 18 +++++++++--------- mm/mremap.c | 6 +++--- 12 files changed, 57 insertions(+), 52 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index 62a8d03ba7e9..b5637de6dde5 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -58,33 +58,34 @@ static unsigned long kvmppc_stt_pages(unsigned long tce_pages) static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc) { - long ret = 0; + long locked_vm, ret = 0; if (!current || !current->mm) return ret; /* process exited */ down_write(¤t->mm->mmap_sem); + locked_vm = atomic_long_read(¤t->mm->locked_vm); if (inc) { unsigned long locked, lock_limit; - locked = current->mm->locked_vm + stt_pages; + locked = locked_vm + stt_pages; lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; if (locked > lock_limit && !capable(CAP_IPC_LOCK)) ret = -ENOMEM; else - current->mm->locked_vm += stt_pages; + atomic_long_add(stt_pages, ¤t->mm->locked_vm); } else { - if (WARN_ON_ONCE(stt_pages > current->mm->locked_vm)) - stt_pages = current->mm->locked_vm; + if (WARN_ON_ONCE(stt_pages > locked_vm)) + stt_pages = locked_vm; - current->mm->locked_vm -= stt_pages; + atomic_long_sub(stt_pages, ¤t->mm->locked_vm); } pr_debug("[%d] RLIMIT_MEMLOCK KVM %c%ld %ld/%ld%s\n", current->pid, inc ? '+' : '-', stt_pages << PAGE_SHIFT, - current->mm->locked_vm << PAGE_SHIFT, + atomic_long_read(¤t->mm->locked_vm) << PAGE_SHIFT, rlimit(RLIMIT_MEMLOCK), ret ? " - exceeded" : ""); diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c index 56c2234cc6ae..a8f66975bf53 100644 --- a/arch/powerpc/mm/mmu_context_iommu.c +++ b/arch/powerpc/mm/mmu_context_iommu.c @@ -41,31 +41,31 @@ struct mm_iommu_table_group_mem_t { static long mm_iommu_adjust_locked_vm(struct mm_struct *mm, unsigned long npages, bool incr) { - long ret = 0, locked, lock_limit; + long ret = 0, locked, lock_limit, locked_vm; if (!npages) return 0; down_write(&mm->mmap_sem); - + locked_vm = atomic_long_read(&mm->locked_vm); if (incr) { - locked = mm->locked_vm + npages; + locked = locked_vm + npages; lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; if (locked > lock_limit && !capable(CAP_IPC_LOCK)) ret = -ENOMEM; else - mm->locked_vm += npages; + atomic_long_add(npages, &mm->locked_vm); } else { - if (WARN_ON_ONCE(npages > mm->locked_vm)) - npages = mm->locked_vm; - mm->locked_vm -= npages; + if (WARN_ON_ONCE(npages > locked_vm)) + npages = locked_vm; + atomic_long_sub(npages, &mm->locked_vm); } pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%ld %ld/%ld\n", current ? current->pid : 0, incr ? '+' : '-', npages << PAGE_SHIFT, - mm->locked_vm << PAGE_SHIFT, + atomic_long_read(&mm->locked_vm) << PAGE_SHIFT, rlimit(RLIMIT_MEMLOCK)); up_write(&mm->mmap_sem); diff --git a/drivers/fpga/dfl-afu-dma-region.c b/drivers/fpga/dfl-afu-dma-region.c index 025aba3ea76c..1a7939c511a0 100644 --- a/drivers/fpga/dfl-afu-dma-region.c +++ b/drivers/fpga/dfl-afu-dma-region.c @@ -45,6 +45,7 @@ void afu_dma_region_init(struct dfl_feature_platform_data *pdata) static int afu_dma_adjust_locked_vm(struct device *dev, long npages, bool incr) { unsigned long locked, lock_limit; + long locked_vm; int ret = 0; /* the task is exiting. */ @@ -53,24 +54,25 @@ static int afu_dma_adjust_locked_vm(struct device *dev, long npages, bool incr) down_write(¤t->mm->mmap_sem); + locked_vm = atomic_long_read(¤t->mm->locked_vm); if (incr) { - locked = current->mm->locked_vm + npages; + locked = locked_vm + npages; lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; if (locked > lock_limit && !capable(CAP_IPC_LOCK)) ret = -ENOMEM; else - current->mm->locked_vm += npages; + atomic_long_add(npages, ¤t->mm->locked_vm); } else { - if (WARN_ON_ONCE(npages > current->mm->locked_vm)) - npages = current->mm->locked_vm; - current->mm->locked_vm -= npages; + if (WARN_ON_ONCE(npages > locked_vm)) + npages = locked_vm; + atomic_long_sub(npages, ¤t->mm->locked_vm); } dev_dbg(dev, "[%d] RLIMIT_MEMLOCK %c%ld %ld/%ld%s\n", current->pid, incr ? '+' : '-', npages << PAGE_SHIFT, - current->mm->locked_vm << PAGE_SHIFT, rlimit(RLIMIT_MEMLOCK), - ret ? "- exceeded" : ""); + atomic_long_read(¤t->mm->locked_vm) << PAGE_SHIFT, + rlimit(RLIMIT_MEMLOCK), ret ? "- exceeded" : ""); up_write(¤t->mm->mmap_sem); diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index b30926e11d87..c834bc2d1e6b 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -45,16 +45,16 @@ static long try_increment_locked_vm(struct mm_struct *mm, long npages) return 0; down_write(&mm->mmap_sem); - locked = mm->locked_vm + npages; + locked = atomic_long_read(&mm->locked_vm) + npages; lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; if (locked > lock_limit && !capable(CAP_IPC_LOCK)) ret = -ENOMEM; else - mm->locked_vm += npages; + atomic_long_add(npages, &mm->locked_vm); pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid, npages << PAGE_SHIFT, - mm->locked_vm << PAGE_SHIFT, + atomic_long_read(&mm->locked_vm) << PAGE_SHIFT, rlimit(RLIMIT_MEMLOCK), ret ? " - exceeded" : ""); @@ -69,12 +69,12 @@ static void decrement_locked_vm(struct mm_struct *mm, long npages) return; down_write(&mm->mmap_sem); - if (WARN_ON_ONCE(npages > mm->locked_vm)) - npages = mm->locked_vm; - mm->locked_vm -= npages; + if (WARN_ON_ONCE(npages > atomic_long_read(&mm->locked_vm))) + npages = atomic_long_read(&mm->locked_vm); + atomic_long_sub(npages, &mm->locked_vm); pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid, npages << PAGE_SHIFT, - mm->locked_vm << PAGE_SHIFT, + atomic_long_read(&mm->locked_vm) << PAGE_SHIFT, rlimit(RLIMIT_MEMLOCK)); up_write(&mm->mmap_sem); } diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index e7cfbf0c8071..f307dc9d5e19 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -276,13 +276,13 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) limit = task_rlimit(dma->task, RLIMIT_MEMLOCK) >> PAGE_SHIFT; - if (mm->locked_vm + npage > limit) + if (atomic_long_read(&mm->locked_vm) + npage > limit) ret = -ENOMEM; } } if (!ret) - mm->locked_vm += npage; + atomic_long_add(npage, &mm->locked_vm); up_write(&mm->mmap_sem); } @@ -419,7 +419,8 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, * pages are already counted against the user. */ if (!rsvd && !vfio_find_vpfn(dma, iova)) { - if (!dma->lock_cap && mm->locked_vm + 1 > limit) { + if (!dma->lock_cap && + atomic_long_read(&mm->locked_vm) + 1 > limit) { put_pfn(*pfn_base, dma->prot); pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__, limit << PAGE_SHIFT); @@ -445,8 +446,8 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, } if (!rsvd && !vfio_find_vpfn(dma, iova)) { - if (!dma->lock_cap && - mm->locked_vm + lock_acct + 1 > limit) { + if (!dma->lock_cap && atomic_long_read(&mm->locked_vm) + + lock_acct + 1 > limit) { put_pfn(pfn, dma->prot); pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__, limit << PAGE_SHIFT); diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 39e96a21366e..f0468bfe022e 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -58,7 +58,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) swap = get_mm_counter(mm, MM_SWAPENTS); SEQ_PUT_DEC("VmPeak:\t", hiwater_vm); SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm); - SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm); + SEQ_PUT_DEC(" kB\nVmLck:\t", atomic_long_read(&mm->locked_vm)); SEQ_PUT_DEC(" kB\nVmPin:\t", mm->pinned_vm); SEQ_PUT_DEC(" kB\nVmHWM:\t", hiwater_rss); SEQ_PUT_DEC(" kB\nVmRSS:\t", total_rss); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 5ed8f6292a53..ee02933d10e5 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -399,7 +399,7 @@ struct mm_struct { unsigned long hiwater_vm; /* High-water virtual memory usage */ unsigned long total_vm; /* Total pages mapped */ - unsigned long locked_vm; /* Pages that have PG_mlocked set */ + atomic_long_t locked_vm; /* Pages that have PG_mlocked set */ unsigned long pinned_vm; /* Refcount permanently increased */ unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */ unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE & ~VM_STACK */ diff --git a/kernel/fork.c b/kernel/fork.c index 2f78d32eaa0f..e0d4d5b8f151 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -978,7 +978,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, mm->core_state = NULL; mm_pgtables_bytes_init(mm); mm->map_count = 0; - mm->locked_vm = 0; + atomic_long_set(&mm->locked_vm, 0); mm->pinned_vm = 0; memset(&mm->rss_stat, 0, sizeof(mm->rss_stat)); spin_lock_init(&mm->page_table_lock); diff --git a/mm/debug.c b/mm/debug.c index cdacba12e09a..3d8db069176f 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -152,7 +152,8 @@ void dump_mm(const struct mm_struct *mm) atomic_read(&mm->mm_count), mm_pgtables_bytes(mm), mm->map_count, - mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, mm->locked_vm, + mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, + atomic_long_read(&mm->locked_vm), mm->pinned_vm, mm->data_vm, mm->exec_vm, mm->stack_vm, mm->start_code, mm->end_code, mm->start_data, mm->end_data, mm->start_brk, mm->brk, mm->start_stack, diff --git a/mm/mlock.c b/mm/mlock.c index 41cc47e28ad6..b213ad97585c 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -562,7 +562,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev, nr_pages = -nr_pages; else if (old_flags & VM_LOCKED) nr_pages = 0; - mm->locked_vm += nr_pages; + atomic_long_add(nr_pages, &mm->locked_vm); /* * vm_flags is protected by the mmap_sem held in write mode. @@ -687,7 +687,7 @@ static __must_check int do_mlock(unsigned long start, size_t len, vm_flags_t fla if (down_write_killable(¤t->mm->mmap_sem)) return -EINTR; - locked += current->mm->locked_vm; + locked += atomic_long_read(¤t->mm->locked_vm); if ((locked > lock_limit) && (!capable(CAP_IPC_LOCK))) { /* * It is possible that the regions requested intersect with diff --git a/mm/mmap.c b/mm/mmap.c index 6c04292e16a7..5c4b39b99fb4 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1339,7 +1339,7 @@ static inline int mlock_future_check(struct mm_struct *mm, /* mlock MCL_FUTURE? */ if (flags & VM_LOCKED) { locked = len >> PAGE_SHIFT; - locked += mm->locked_vm; + locked += atomic_long_read(&mm->locked_vm); lock_limit = rlimit(RLIMIT_MEMLOCK); lock_limit >>= PAGE_SHIFT; if (locked > lock_limit && !capable(CAP_IPC_LOCK)) @@ -1825,7 +1825,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, vma == get_gate_vma(current->mm)) vma->vm_flags &= VM_LOCKED_CLEAR_MASK; else - mm->locked_vm += (len >> PAGE_SHIFT); + atomic_long_add(len >> PAGE_SHIFT, &mm->locked_vm); } if (file) @@ -2291,7 +2291,7 @@ static int acct_stack_growth(struct vm_area_struct *vma, if (vma->vm_flags & VM_LOCKED) { unsigned long locked; unsigned long limit; - locked = mm->locked_vm + grow; + locked = atomic_long_read(&mm->locked_vm) + grow; limit = rlimit(RLIMIT_MEMLOCK); limit >>= PAGE_SHIFT; if (locked > limit && !capable(CAP_IPC_LOCK)) @@ -2385,7 +2385,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) */ spin_lock(&mm->page_table_lock); if (vma->vm_flags & VM_LOCKED) - mm->locked_vm += grow; + atomic_long_add(grow, &mm->locked_vm); vm_stat_account(mm, vma->vm_flags, grow); anon_vma_interval_tree_pre_update_vma(vma); vma->vm_end = address; @@ -2466,7 +2466,7 @@ int expand_downwards(struct vm_area_struct *vma, */ spin_lock(&mm->page_table_lock); if (vma->vm_flags & VM_LOCKED) - mm->locked_vm += grow; + atomic_long_add(grow, &mm->locked_vm); vm_stat_account(mm, vma->vm_flags, grow); anon_vma_interval_tree_pre_update_vma(vma); vma->vm_start = address; @@ -2787,11 +2787,11 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len, /* * unlock any mlock()ed ranges before detaching vmas */ - if (mm->locked_vm) { + if (atomic_long_read(&mm->locked_vm)) { struct vm_area_struct *tmp = vma; while (tmp && tmp->vm_start < end) { if (tmp->vm_flags & VM_LOCKED) { - mm->locked_vm -= vma_pages(tmp); + atomic_long_sub(vma_pages(tmp), &mm->locked_vm); munlock_vma_pages_all(tmp); } @@ -3050,7 +3050,7 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla mm->total_vm += len >> PAGE_SHIFT; mm->data_vm += len >> PAGE_SHIFT; if (flags & VM_LOCKED) - mm->locked_vm += (len >> PAGE_SHIFT); + atomic_long_add(len >> PAGE_SHIFT, &mm->locked_vm); vma->vm_flags |= VM_SOFTDIRTY; return 0; } @@ -3122,7 +3122,7 @@ void exit_mmap(struct mm_struct *mm) up_write(&mm->mmap_sem); } - if (mm->locked_vm) { + if (atomic_long_read(&mm->locked_vm)) { vma = mm->mmap; while (vma) { if (vma->vm_flags & VM_LOCKED) diff --git a/mm/mremap.c b/mm/mremap.c index 7f9f9180e401..abcc58690f17 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -360,7 +360,7 @@ static unsigned long move_vma(struct vm_area_struct *vma, } if (vm_flags & VM_LOCKED) { - mm->locked_vm += new_len >> PAGE_SHIFT; + atomic_long_add(new_len >> PAGE_SHIFT, &mm->locked_vm); *locked = true; } @@ -411,7 +411,7 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr, if (vma->vm_flags & VM_LOCKED) { unsigned long locked, lock_limit; - locked = mm->locked_vm << PAGE_SHIFT; + locked = atomic_long_read(&mm->locked_vm) << PAGE_SHIFT; lock_limit = rlimit(RLIMIT_MEMLOCK); locked += new_len - old_len; if (locked > lock_limit && !capable(CAP_IPC_LOCK)) @@ -600,7 +600,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, vm_stat_account(mm, vma->vm_flags, pages); if (vma->vm_flags & VM_LOCKED) { - mm->locked_vm += pages; + atomic_long_add(pages, &mm->locked_vm); locked = true; new_addr = addr; } From patchwork Mon Nov 5 16:55:53 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668677 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BAC12175A for ; Mon, 5 Nov 2018 16:57:17 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A9915293D9 for ; Mon, 5 Nov 2018 16:57:17 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 9D1242946D; Mon, 5 Nov 2018 16:57:17 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3938129463 for ; Mon, 5 Nov 2018 16:57:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C3B986B0276; Mon, 5 Nov 2018 11:56:44 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id A8D866B027A; Mon, 5 Nov 2018 11:56:44 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8A5306B0278; Mon, 5 Nov 2018 11:56:44 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yw1-f69.google.com (mail-yw1-f69.google.com [209.85.161.69]) by kanga.kvack.org (Postfix) with ESMTP id 4C7186B0277 for ; Mon, 5 Nov 2018 11:56:44 -0500 (EST) Received: by mail-yw1-f69.google.com with SMTP id a62-v6so7834645ywf.16 for ; Mon, 05 Nov 2018 08:56:44 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=pFJTplxowb9ZeUGTgNoJLdHdBNDs48jVBVOWfuQrUbk=; b=hJ7lqfqcuP7H5a6kyyz+mdvqWMfa7yIF0jgkNag5ARnI7VhAoGFKW3GOFO63JHqY/k JjYfLMKGrYYTEVDwXTyIYUzbmehfBTvtBMWHWmsL/GLl82JTYpuGT9i/KIjEtL/8P+jd 9JnSHULqmD2meuIX71E6nj6XVCx8PeL5ZK8Q07IRLguvH/tSV8pZ99hbVjgDje6zpfk+ LHpIUJspE0BJNu6+TucQJ+brGTaGbYnmOGQpYNNO/STKC2/xl02jvCQ4HwXHSYrNx3Kp WtQ0gXghtwbcNfCRnrU1wSMFFzKGRweLEp3l80g735XEXmWrDVk82Ewog19LElB1zMgz amBg== X-Gm-Message-State: AGRZ1gKXPyO9gJreNnMp6Xy8wItfLhd5jTnfLUfZ3FQXBBKaKngiifg0 HeIcv/14hFGCwQWrZQkE95iO3WwrGB0phlYciYUh+W2PweWq6IUPcFxeVKnG5izmlzZWUJVQMtj 0vEeA2sH2cYSlpD02hZyL8s8Q0Pw4XxuYZQAeWOKdLGwdvLojGjZ6vRo9WW4DKx62rA== X-Received: by 2002:a81:e4d:: with SMTP id 74-v6mr21840774ywo.162.1541437003985; Mon, 05 Nov 2018 08:56:43 -0800 (PST) X-Google-Smtp-Source: AJdET5dPPDxNphfM4bBdCk+lk3XFiN0blBdYzWKQN+wsp7p5KCr8lDlg72LhZsFe8PwLY0eRou+O X-Received: by 2002:a81:e4d:: with SMTP id 74-v6mr21840713ywo.162.1541437003109; Mon, 05 Nov 2018 08:56:43 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541437003; cv=none; d=google.com; s=arc-20160816; b=RN3QbuP+D/p70mTcvvIU/gaj2+LrVBZH06URDXIGVwLda16+35PnyiT9bGnODucDeq iHmWjFGlFQLHWbsjmQIFD7OVDiYky6iK2o2txO4H+xAOExs3AjEx1RcX0Q6uZkYMiRxW GoiGWAyRbHZKYPlrfFt5OlCZmh/JCcOo4bujedrkNHrviIj2nJ3LXVv0dZTCYzwjVQYA g8iTYlbhh/vy965bQvqbwbx/AsUfrjaFxYlHzwQmj7ETIgD6bDy8YCpBJa5rozyxjzbi i4yT/VgpPO43D3NGmLgSnd+NNnRglKMw7tdwNk2eRM6l2R8Rqf4RRFlDGBiLFb83SXZN SRLw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:dkim-signature; bh=pFJTplxowb9ZeUGTgNoJLdHdBNDs48jVBVOWfuQrUbk=; b=xmBdqDY/TEBkZLkaTeXNCS4roshXxRZmH3ZpFfczdm3dBcV09Mk/1I78b89f+cU4mg tttHQnW2IbNBpTCGbFR4LhmcQ0xqLfyX9wh+LIXBpdHQFsvp242tchga+2faEjBZHkIF Hgv1FZj2ZAwm8e2HMpqD6WQbd6NRy/gaH9ESGDFqsi2YUtmUkeu9vhNQwD5UROpdwqaw 9+DS8UMcQZIWAPo5wq6/JRkWWFRZFERQLWT3VCwKIQlCg3FLdqRSpw/TAiQFAWycPvvg tD7FJ5sIW5uYooesVMxgRrLi9fKLaH5Z25Cbv0v9cm4tHbFDqUKN4rIvzplixqvk4sxJ vkmQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=UEDXi9JQ; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.85 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from userp2120.oracle.com (userp2120.oracle.com. [156.151.31.85]) by mx.google.com with ESMTPS id r4-v6si23252143yba.429.2018.11.05.08.56.42 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 05 Nov 2018 08:56:43 -0800 (PST) Received-SPF: pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.85 as permitted sender) client-ip=156.151.31.85; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=UEDXi9JQ; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.85 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5Gs4bK052321; Mon, 5 Nov 2018 16:56:26 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2018-07-02; bh=pFJTplxowb9ZeUGTgNoJLdHdBNDs48jVBVOWfuQrUbk=; b=UEDXi9JQkI61IrVjTEL9K2GK2kUyKmRXKt4ss+WNs6+EK0pNVLSyhLeGQntyiJM5cEIY 1QTs2L+45uBq0NzchVRwPJe6uRKAwuGgqm6hOlQix9uyUSXu592okWo3iSWYTsl/qaw3 8mNuLrR1FcUPpxeOoXm4mDjSV9FxagWh8aPyOdKt8PLonHdEZfXgPE4AltLy3UCoTOeK jUog8/C/ZgDbE8JAsDhSMmGritc7Bt190iR/MqdkBatkrFyrFLNaYl5BAZlh0Py/U64X SJ8E5YAvEUlYnM8jGUjYCG6xNTBhW9MI0wXB7jXzKedma3SvDo9hFnWSZZv5KHpQC47Q JQ== Received: from userv0022.oracle.com (userv0022.oracle.com [156.151.31.74]) by userp2120.oracle.com with ESMTP id 2nh4aqg2e3-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:26 +0000 Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by userv0022.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuKSY024985 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:20 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id wA5GuJFq008435; Mon, 5 Nov 2018 16:56:19 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:19 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 08/13] vfio: remove unnecessary mmap_sem writer acquisition around locked_vm Date: Mon, 5 Nov 2018 11:55:53 -0500 Message-Id: <20181105165558.11698-9-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000002, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Now that mmap_sem is no longer required for modifying locked_vm, remove it in the VFIO code. [XXX Can be sent separately, along with similar conversions in the other places mmap_sem was taken for locked_vm. While at it, could make similar changes to pinned_vm.] Signed-off-by: Daniel Jordan --- drivers/vfio/vfio_iommu_type1.c | 26 +++++++++----------------- 1 file changed, 9 insertions(+), 17 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index f307dc9d5e19..9e52a24eb45b 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -258,7 +258,8 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn) static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) { struct mm_struct *mm; - int ret; + long locked_vm; + int ret = 0; if (!npage) return 0; @@ -267,24 +268,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) if (!mm) return -ESRCH; /* process exited */ - ret = down_write_killable(&mm->mmap_sem); - if (!ret) { - if (npage > 0) { - if (!dma->lock_cap) { - unsigned long limit; - - limit = task_rlimit(dma->task, - RLIMIT_MEMLOCK) >> PAGE_SHIFT; + locked_vm = atomic_long_add_return(npage, &mm->locked_vm); - if (atomic_long_read(&mm->locked_vm) + npage > limit) - ret = -ENOMEM; - } + if (npage > 0 && !dma->lock_cap) { + unsigned long limit = task_rlimit(dma->task, RLIMIT_MEMLOCK) >> + PAGE_SHIFT; + if (locked_vm > limit) { + atomic_long_sub(npage, &mm->locked_vm); + ret = -ENOMEM; } - - if (!ret) - atomic_long_add(npage, &mm->locked_vm); - - up_write(&mm->mmap_sem); } if (async) From patchwork Mon Nov 5 16:55:54 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668685 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 89D9614BD for ; Mon, 5 Nov 2018 16:57:28 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 77A872946D for ; Mon, 5 Nov 2018 16:57:28 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 6AB6329463; Mon, 5 Nov 2018 16:57:28 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A645628A75 for ; Mon, 5 Nov 2018 16:57:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E32286B0279; Mon, 5 Nov 2018 11:56:45 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id D76926B027D; Mon, 5 Nov 2018 11:56:45 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A3E2A6B027B; Mon, 5 Nov 2018 11:56:45 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yb1-f199.google.com (mail-yb1-f199.google.com [209.85.219.199]) by kanga.kvack.org (Postfix) with ESMTP id 771EC6B027A for ; Mon, 5 Nov 2018 11:56:45 -0500 (EST) Received: by mail-yb1-f199.google.com with SMTP id i13-v6so7875955ybe.14 for ; Mon, 05 Nov 2018 08:56:45 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=jxkC7rLLT5JE0Jtpp5a2UGIfbNY+hhiWsmmEye+YCsw=; b=nu/NBigLErXno9zuv++O2paC5R8bjlogM3b3Ck+dY+R0BjwT5hemGtHFtfzCfME7Lk bE6zL3pJdRHudNVlXfkVb13kWbhUKcpFZ6cCJ5Gey3UuxBLCm+ThLIstFpEEx46yu3dM 1o5gjDSsSxHlT8a9TOKZcYE+KLyln9YkY+I4d+xvHCOcwcUg6EYHkTmURoqYak7QSRBj bH5MQD8KQpPjVmyxZ0jK5iJ7psktWghMkttSosM53GTFmHGVNyRDjhCd2bVCKfR8EQYZ /V2lFsJgDQfmuOA6epKswvUmetUVDs070iw5b3I00+EATpnrmCUtLy4wQLl+kH3OpFsj HC1g== X-Gm-Message-State: AGRZ1gJi3ctkg3QLgTI3glvDOiyLtsN8xe663ZOjmsG2i+WyPDEdfkqZ +5r/YpbE7shp9K3syjKDBuunEd9COKJ2Ap2KNu3IiVfjKwJLHtPA0KYZfU8ehSR8dTcAiEtIv4s 8QRwD/GTCwz3eXz3933RzX7mKJz2iXvld4UmREzaxquaC/fXfLW5dlKnLbUOOrOdQAQ== X-Received: by 2002:a25:ab24:: with SMTP id u33-v6mr22798592ybi.500.1541437005193; Mon, 05 Nov 2018 08:56:45 -0800 (PST) X-Google-Smtp-Source: AJdET5fs+jjjlOuM+XLvUP10RzjjohiRDg8Z/sdy14/w329T8FpFUXzJ4TncOEiRQkJSBQhOa6Ro X-Received: by 2002:a25:ab24:: with SMTP id u33-v6mr22798526ybi.500.1541437004247; Mon, 05 Nov 2018 08:56:44 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541437004; cv=none; d=google.com; s=arc-20160816; b=D0Jt81LQk4ouA04n2MbIJUbL7DfXMJ+HaVMI9Kfw0TUpH02BvojY455aYapbDt2Dmf Cn0jqI+TWnuRV9xN//0IfaDc5AgawCldtTOetlt59LUtERwVslShOYtKesyYKA/vX0hs eyPEpBxnFCbmMQxN8w+rbwl6Sl3Rqv5oIDJraKGmXOeKrgglRAX66cUqFR8vaaKZSZ9R oBjucE65MF4aQfOZ6+x1A0SlcA+03K46Xc0ccxTh6KktSlaEOQUB2bE4w2CUsHm2S0j/ hNYKrd4DXPD7T3ITlgyUFmNKultV4S02v0HzJQ5BkfDEmzg1FJB+GgQGDRvtblIMEFA0 0ilw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:dkim-signature; bh=jxkC7rLLT5JE0Jtpp5a2UGIfbNY+hhiWsmmEye+YCsw=; b=w40r9s/ZiBtzqdtsaENWRq43OFho+lUcZQpXyOAvJRVAzpHOBadwB91iqx2++cEc4w 6YvLxwVtYqhKfUpkbSLfhlX3qroVvTsvlhuCstOHLyNAKuZy3dooEQjqyNE7PIrQfsFm +YWmnit+hdCL2N5IDzZ5ONzRzg5cQNVpwoxfXvNDh7deXO5c8LUfIrnIqjYqNFNzKLaR zLd/HxXPAbn0+FIBu6+NrIcVMU361rHkxCFsjrUfEWu2OLmBdkUW0IcWAzrrDO/Du+lJ +bKuKp6fzkpgPRZHbgK3mjvqO3jbXCgcdS6gqWuWhm4ELxjoZI7VGnmJ7lpU/aOKo2rH XKJw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=T8iqLWIc; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.85 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from userp2120.oracle.com (userp2120.oracle.com. [156.151.31.85]) by mx.google.com with ESMTPS id q9-v6si6436234ywq.282.2018.11.05.08.56.44 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 05 Nov 2018 08:56:44 -0800 (PST) Received-SPF: pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.85 as permitted sender) client-ip=156.151.31.85; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=T8iqLWIc; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.85 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5GsYIi052564; Mon, 5 Nov 2018 16:56:23 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type : content-transfer-encoding; s=corp-2018-07-02; bh=jxkC7rLLT5JE0Jtpp5a2UGIfbNY+hhiWsmmEye+YCsw=; b=T8iqLWIcT6mSMpnkQPq7gWEjNj1BYI7X7FJ73FI5DSg73jz2tZmbQk8hmJQpb4VJs0Yo C8HEbEo1PNSlzf6akxFswkhc7suHjR+pUMdR3ycD9+4ChKwCi+YMPAKzVtvYxzF6E54T ormXD6ufwf5U5YAV39XORznSYUVfH2hGcGdogyEGasAoqQAighyMFXGs//6idJzAYKdL 3uHzOm6ntgeXU0zyzLGaFTyDbqpIduJp6qfRJT2eBhZcgAwH1ftyiF1b6GiBZ28m4LKh ITXmyL2q/0Xn6jRqYvByMIvI5LUzUeOeUelO5/YqiphEpWaGlua6ysZLjICtUcp0uG/O fw== Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by userp2120.oracle.com with ESMTP id 2nh4aqg2ds-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:23 +0000 Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuMfm004956 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:22 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id wA5GuLRt008448; Mon, 5 Nov 2018 16:56:21 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:21 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 09/13] vfio: relieve mmap_sem reader cacheline bouncing by holding it longer Date: Mon, 5 Nov 2018 11:55:54 -0500 Message-Id: <20181105165558.11698-10-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Profiling shows significant time being spent on atomic ops in mmap_sem reader acquisition. mmap_sem is taken and dropped for every single base page during pinning, so this is not surprising. Reduce the number of times mmap_sem is taken by holding for longer, which relieves atomic cacheline bouncing. Results for all VFIO page pinning patches ----------------------------------------- The test measures the time from qemu invocation to the start of guest boot. The guest uses kvm with 320G memory backed with THP. 320G fits in a node on the test machine used here, so there was no thrashing in reclaim because of __GFP_THISNODE in THP allocations[1]. CPU: 2 nodes * 24 cores/node * 2 threads/core = 96 CPUs Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz memory: 754G split evenly between nodes scaling_governor: performance patch 6 patch 8 patch 9 (this one) ----------------------- ------------------------ ------------------------ thr speedup average sec speedup average sec speedup average sec 1 65.0 (± 0.6%) 65.2 (± 0.5%) 65.5 (± 0.4%) 2 1.5x 42.8 (± 5.8%) 1.8x 36.0 (± 0.9%) 1.9x 34.4 (± 0.3%) 3 1.9x 35.0 (±11.3%) 2.5x 26.4 (± 4.2%) 2.8x 23.7 (± 0.2%) 4 2.3x 28.5 (± 1.3%) 3.1x 21.2 (± 2.8%) 3.6x 18.3 (± 0.3%) 5 2.5x 26.2 (± 1.5%) 3.6x 17.9 (± 0.9%) 4.3x 15.1 (± 0.3%) 6 2.7x 24.5 (± 1.8%) 4.0x 16.5 (± 3.0%) 5.1x 12.9 (± 0.1%) 7 2.8x 23.5 (± 4.9%) 4.2x 15.4 (± 2.7%) 5.7x 11.5 (± 0.6%) 8 2.8x 22.8 (± 1.8%) 4.2x 15.5 (± 4.7%) 6.4x 10.3 (± 0.8%) 12 3.2x 20.2 (± 1.4%) 4.4x 14.7 (± 2.9%) 8.6x 7.6 (± 0.6%) 16 3.3x 20.0 (± 0.7%) 4.3x 15.4 (± 1.3%) 10.2x 6.4 (± 0.6%) At patch 6, lock_stat showed long reader wait time on mmap_sem writers, leading to patch 8. At patch 8, profiling revealed the issue with mmap_sem described above. Across all three patches, performance consistently improves as the thread count increases. The one exception is the antiscaling with nthr=16 in patch 8: those mmap_sem atomics are really bouncing around the machine. The performance with patch 9 looks pretty good overall. I'm working on finding the next bottleneck, and this is where it stopped: When nthr=16, the obvious issue profiling showed was contention on the split PMD page table lock when pages are faulted in during the pinning (>2% of the time). A split PMD lock protects a PUD_SIZE-ed amount of page table mappings (1G on x86), so if threads were operating on smaller chunks and contending in the same PUD_SIZE range, this could be the source of contention. However, when nthr=16, threads operate on 5G chunks (320G / 16 threads / (1< --- drivers/vfio/vfio_iommu_type1.c | 28 +++++++++++++++++++--------- 1 file changed, 19 insertions(+), 9 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 9e52a24eb45b..0d6ec0369de5 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -331,7 +331,7 @@ static int put_pfn(unsigned long pfn, int prot) } static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, - int prot, unsigned long *pfn) + int prot, unsigned long *pfn, bool handle_mmap_sem) { struct page *page[1]; struct vm_area_struct *vma; @@ -342,7 +342,8 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, if (prot & IOMMU_WRITE) flags |= FOLL_WRITE; - down_read(&mm->mmap_sem); + if (handle_mmap_sem) + down_read(&mm->mmap_sem); if (mm == current->mm) { ret = get_user_pages_longterm(vaddr, 1, flags, page, vmas); } else { @@ -360,14 +361,16 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, put_page(page[0]); } } - up_read(&mm->mmap_sem); + if (handle_mmap_sem) + up_read(&mm->mmap_sem); if (ret == 1) { *pfn = page_to_pfn(page[0]); return 0; } - down_read(&mm->mmap_sem); + if (handle_mmap_sem) + down_read(&mm->mmap_sem); vma = find_vma_intersection(mm, vaddr, vaddr + 1); @@ -377,7 +380,8 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, ret = 0; } - up_read(&mm->mmap_sem); + if (handle_mmap_sem) + up_read(&mm->mmap_sem); return ret; } @@ -399,9 +403,12 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, if (!mm) return -ENODEV; - ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base); - if (ret) + down_read(&mm->mmap_sem); + ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base, false); + if (ret) { + up_read(&mm->mmap_sem); return ret; + } pinned++; rsvd = is_invalid_reserved_pfn(*pfn_base); @@ -416,6 +423,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, put_pfn(*pfn_base, dma->prot); pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__, limit << PAGE_SHIFT); + up_read(&mm->mmap_sem); return -ENOMEM; } lock_acct++; @@ -427,7 +435,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, /* Lock all the consecutive pages from pfn_base */ for (vaddr += PAGE_SIZE, iova += PAGE_SIZE; pinned < npage; pinned++, vaddr += PAGE_SIZE, iova += PAGE_SIZE) { - ret = vaddr_get_pfn(mm, vaddr, dma->prot, &pfn); + ret = vaddr_get_pfn(mm, vaddr, dma->prot, &pfn, false); if (ret) break; @@ -444,6 +452,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__, limit << PAGE_SHIFT); ret = -ENOMEM; + up_read(&mm->mmap_sem); goto unpin_out; } lock_acct++; @@ -451,6 +460,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, } out: + up_read(&mm->mmap_sem); ret = vfio_lock_acct(dma, lock_acct, false); unpin_out: @@ -497,7 +507,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr, if (!mm) return -ENODEV; - ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base); + ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base, true); if (!ret && do_accounting && !is_invalid_reserved_pfn(*pfn_base)) { ret = vfio_lock_acct(dma, 1, true); if (ret) { From patchwork Mon Nov 5 16:55:55 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668663 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0843114E2 for ; Mon, 5 Nov 2018 16:56:54 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E79AB293D9 for ; Mon, 5 Nov 2018 16:56:53 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id D96252946D; Mon, 5 Nov 2018 16:56:53 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 70EE9293D9 for ; Mon, 5 Nov 2018 16:56:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 735AC6B0271; Mon, 5 Nov 2018 11:56:40 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 70ECB6B0272; Mon, 5 Nov 2018 11:56:40 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 569B46B0273; Mon, 5 Nov 2018 11:56:40 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yw1-f69.google.com (mail-yw1-f69.google.com [209.85.161.69]) by kanga.kvack.org (Postfix) with ESMTP id 1E62B6B0271 for ; Mon, 5 Nov 2018 11:56:40 -0500 (EST) Received: by mail-yw1-f69.google.com with SMTP id b76-v6so7749931ywb.11 for ; Mon, 05 Nov 2018 08:56:40 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=a2OiRCY8BwYFe4eDIMRKDWwH3Sf4ADRCu0KAZGhXWg0=; b=d2OdKgdg7Nj9Ye5IsNzWT0rK/6oMyJz+ooGqayu36wzQtWZ80+czcsJ59Vc18CsagL olfL9WQnEVDfLdozimuQk/qiM5hRi/y6cY6bF0BkUeE1cuKdyAonwb+ddzpKPMqG6VRV 8nw9phXueNQcb8QPl0VW4zgwBF5TsTlLACxSMKYFSL6spGKolbQOWVLQqPkeEK6LyM0H 32zxj+uH9dTASjarOdMcTUFFjbuLlGDTukaQvo1oyk70lQrlkha1oQa7K9tImAnNYHeW Ms5R51N5c/BrLNwLP2GWxoOLl/VpOjqHL+pp4qNh8atb2SrxqpOvpjJj0mlsKpDBioDx diyA== X-Gm-Message-State: AGRZ1gKASRjRy8ahx9+qgsiK0UhiJJbz5Va1idB6TRUzp5/yK88BufyO IMarLDnfPlrHp4QxOBUaY0Fop/febx/6kWSwSvrpW+P566lccFGxahHPM3hMnaa+7t/a0xMPUbZ Y9O62Kgy14bnUjHZeHT75JLtPTIF+dzQGH/zJrmRSsZLwam4CeCJGQ+fyXPOIDuBfcA== X-Received: by 2002:a0d:cb87:: with SMTP id n129-v6mr20767866ywd.213.1541436999856; Mon, 05 Nov 2018 08:56:39 -0800 (PST) X-Google-Smtp-Source: AJdET5eS51uYXaht3X6BQpicDi/GC5OSv63GD/+96sVBwDQ+8sXdGgCxbRXUxYro3jP9BhTK8vwj X-Received: by 2002:a0d:cb87:: with SMTP id n129-v6mr20767833ywd.213.1541436999219; Mon, 05 Nov 2018 08:56:39 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541436999; cv=none; d=google.com; s=arc-20160816; b=NDumnl8KU5lsQwwdOWNFahBnh0hnB/65rHhu0+8IvpkZWS3GeBSz8arqInl4PaxD7w 5YBHZpDKBnv2aYR2xtJme/a9fC7wFhJoBPtt6tRmjzLrcQmeIpXl4GJ2D6hGCR5qicNE hnjbNBBjlWJ8ZaUlMJG6SpUrTjX8b8lyozZ5dxDEIHtL6GDqQL/dU7UhvJiM1gSKifiz 3rS2h0DfcM00VPBzm37cOtqM9zpyze2iQQZvXkz59gbDFZhRIFt+M1hTj8sfGKyEIY9l lXkcMGiJ5KCyJjj87EMGuspu39xBT4brUdJuRaRFkVJkqizPcDFVe81ecR/Z/19qYLSH DzoQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:dkim-signature; bh=a2OiRCY8BwYFe4eDIMRKDWwH3Sf4ADRCu0KAZGhXWg0=; b=wMnFH1xk2979OJPea8mEeThdfeC7DbPflkZ+XxlJbGVY4bGr2eRc20YQC+2B0oHfH+ 2/BXvVPegddFh65Z1tg1qCnjarjYeLB/fheg5ykQQ/UUQKCUQDJj8WQ01L5/N4AKMOqR ri3wsViElU3YTF8PxTuRDKLrvJfLmN+WAidJaiwPYJ3C1ZKEa6QS+trm+shH7sKMdD7E qhttwKTlYDxpxmwPkAZoDDy9jlqliiQaaTvEfyMDoPoRoZmEnlDEPEmXimSY0it1jCAX AiX6wJJd97/wnf7S546UGtVWgXl1z9YVvQoG5RH1crt3cUJTFYvCzW1PTLzdh3FHulF6 4Ofw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=KAhi1q2w; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.86 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from userp2130.oracle.com (userp2130.oracle.com. [156.151.31.86]) by mx.google.com with ESMTPS id r7-v6si25689731ywd.74.2018.11.05.08.56.38 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 05 Nov 2018 08:56:39 -0800 (PST) Received-SPF: pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.86 as permitted sender) client-ip=156.151.31.86; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=KAhi1q2w; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.86 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5Gs0qK097046; Mon, 5 Nov 2018 16:56:25 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2018-07-02; bh=a2OiRCY8BwYFe4eDIMRKDWwH3Sf4ADRCu0KAZGhXWg0=; b=KAhi1q2wPZLA+lEK3/3KyAFRrNkzLqmyxyDpGjQBIRkQwSlyk2/8oO9cw6hBVtplTD0Y Fr8lywyV+UoD7H5NFj17FJ42jAJbGq2lg05a59C8u/REeu+F4N7MwoFCUpeubJBVUKYX 4N7PDIBXJBmuJNPIuWdfFC39Cj+QCYHBlaGg/AFPGU+mJgSMkbb/ll62cYJUFdZ5JoYJ Ui41GCe9B+4kFzh7uMqR7qSdRd4dxCpKg+h9km1Z3MJ1Mdzmd1KfAzxCf3ax3bMAWOYy IkW/pkXENgdYH1+GwlT+AYH1b0Tq+o1epsWlUIMJJdcANSEpQiigwKPy+wUK55U6482m Ag== Received: from userv0022.oracle.com (userv0022.oracle.com [156.151.31.74]) by userp2130.oracle.com with ESMTP id 2nh33tr6bp-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:24 +0000 Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by userv0022.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuOXj025118 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:24 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id wA5GuNaR008457; Mon, 5 Nov 2018 16:56:23 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:23 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 10/13] mm: enlarge type of offset argument in mem_map_offset and mem_map_next Date: Mon, 5 Nov 2018 11:55:55 -0500 Message-Id: <20181105165558.11698-11-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=778 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 X-Bogosity: Ham, tests=bogofilter, spamicity=0.014894, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Changes the type of 'offset' from int to unsigned long in both mem_map_offset and mem_map_next. This facilitates ktask's use of mem_map_next with its unsigned long types to avoid silent truncation when these unsigned longs are passed as ints. It also fixes the preexisting truncation of 'offset' from unsigned long to int by the sole caller of mem_map_offset, follow_hugetlb_page. Signed-off-by: Daniel Jordan --- mm/internal.h | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 3b1ec1412fd2..cc90de4d4e01 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -367,7 +367,8 @@ static inline void mlock_migrate_page(struct page *new, struct page *old) { } * the maximally aligned gigantic page 'base'. Handle any discontiguity * in the mem_map at MAX_ORDER_NR_PAGES boundaries. */ -static inline struct page *mem_map_offset(struct page *base, int offset) +static inline struct page *mem_map_offset(struct page *base, + unsigned long offset) { if (unlikely(offset >= MAX_ORDER_NR_PAGES)) return nth_page(base, offset); @@ -378,8 +379,8 @@ static inline struct page *mem_map_offset(struct page *base, int offset) * Iterator over all subpages within the maximally aligned gigantic * page 'base'. Handle any discontiguity in the mem_map. */ -static inline struct page *mem_map_next(struct page *iter, - struct page *base, int offset) +static inline struct page *mem_map_next(struct page *iter, struct page *base, + unsigned long offset) { if (unlikely((offset & (MAX_ORDER_NR_PAGES - 1)) == 0)) { unsigned long pfn = page_to_pfn(base) + offset; From patchwork Mon Nov 5 16:55:56 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668675 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E6A5B175A for ; Mon, 5 Nov 2018 16:57:12 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D3687293D9 for ; Mon, 5 Nov 2018 16:57:12 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id C6E492946D; Mon, 5 Nov 2018 16:57:12 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 07BD6293D9 for ; Mon, 5 Nov 2018 16:57:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9B7436B0277; Mon, 5 Nov 2018 11:56:44 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 906F06B0276; Mon, 5 Nov 2018 11:56:44 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7A82F6B0279; Mon, 5 Nov 2018 11:56:44 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yb1-f197.google.com (mail-yb1-f197.google.com [209.85.219.197]) by kanga.kvack.org (Postfix) with ESMTP id 4E4CB6B0278 for ; Mon, 5 Nov 2018 11:56:44 -0500 (EST) Received: by mail-yb1-f197.google.com with SMTP id o18-v6so7925425ybp.9 for ; Mon, 05 Nov 2018 08:56:44 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=gmXHeNTucBAhP5oOOIcBjLrTNKwY0b2G6CBWPMLTzgY=; b=c9k6f+kdjdoZ1TVF7/yyLipBp4LNXrYPi8+P1vcFsocggfaxqRzmMoVUVzmggW3wsz 4CX3kz0+gklIMo6idjLeF6mSflndaoBIStl7zNz1F1aaiVWRlQPxWD4iqmLgJQfpbAjz 6qkuJDpAq1BvreAbBkY+dQHDoz4Mhx2tQ6cJ3flVyvc01oQcXN2Dszv5ptjSZRlUp1p0 sG2gt9lYl/np9aCFdRh/g9edH1HW6fIprpun6x6NF65VNw6qEFyfPkmYWxEeMLirjGoU +Mf5xsAW+8lez+KwyCT0kFUagknlAOaprZTJFijwMZjlpu83o723HawxCOqqIwfxCQxD xgWA== X-Gm-Message-State: AGRZ1gJEGewC9N182QnTF2PCqfo3zemhfl1dO/yUkZ0heqBAEPCJlnGL v1SP1Lmw6lR2JxwiqWlj1eH+dLRy8SQqUe9PaIRol2DweB1UDsKPcceDmTOghPf7grVt1xzGgRK Huz/v8ZKFMzHARDcOGLNZjYOc75av9MlUikelUz2QrbH6BKKbvqDaFJNIfHqmMDk0ng== X-Received: by 2002:a25:8383:: with SMTP id t3-v6mr4946387ybk.384.1541437003996; Mon, 05 Nov 2018 08:56:43 -0800 (PST) X-Google-Smtp-Source: AJdET5eg0X6E/Ttup8IkJrgNPKRw94y/xUfi+YOcFSME/tDzbxhmIiQO4CeMaoPyz8T3Yh2mseQQ X-Received: by 2002:a25:8383:: with SMTP id t3-v6mr4946305ybk.384.1541437002830; Mon, 05 Nov 2018 08:56:42 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541437002; cv=none; d=google.com; s=arc-20160816; b=YUKz5MOSTAlPL5TihN3R6a9cig8+VbjJxnSN9r08Hy0QH/IuFAmStBppIckbWbdUuz oIIB0JrMTfNYG+rj74gtiTgGRi4T9wm+mrBOfOfxsoNX/f2fb0RjFPIScyJ1xeAveT5H U9wGe+mdPjMvqzsUBHD762bu5syjjs/MJ6KAvENo4rUbNTGJw6FzJkpLE7eaNRlOWgwO fuUNMTD7UwpHjVWIMmMq3CjvCRyQAgVPt3ytT2LRp/JlJJVGlSL9bybHZLj2zF4q36FY r0OgNXRyvzbOEGAZtBcFXhnrQb/uokoOa0D6gEW1fqvIiRxxj2bEfWDjVAO/014/ONAU MWBw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:dkim-signature; bh=gmXHeNTucBAhP5oOOIcBjLrTNKwY0b2G6CBWPMLTzgY=; b=GuYjSbriYAJIX1IyChtUO1jdNURVMlJbqhnXWOKKVRCLxWdx820O0OK+UNVMfB94bj c/+qZCcp3pVcOWUzt9fNfhu95WiNcZGqIHm5b7E0wUyJIcVp5Lmb/9veRxTBx/0l9Fkh EENsPS2QHsDIha9hpq9fRtkqpQKPg8YmHTZCVSeYEh7kNSOwezD7jiIR3xt+5e/QdJhs mJuJSXyq/VEbPpR64GwgE3Zpl/ej9P4G3Zpgc71CiBoBCs7RIAC0RFvWk/GVp0Xa1R3A L1qgx/ORTKJyP14naf0lc14uDwCClmT2MNvNh1xceMwwRjUD5SrsF4Fzoqt4eQtGnr1R zbgw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=U8hcULPw; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.86 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from userp2130.oracle.com (userp2130.oracle.com. [156.151.31.86]) by mx.google.com with ESMTPS id g15-v6si3935331ybq.110.2018.11.05.08.56.42 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 05 Nov 2018 08:56:42 -0800 (PST) Received-SPF: pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.86 as permitted sender) client-ip=156.151.31.86; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=U8hcULPw; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.86 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5Gs5AK097190; Mon, 5 Nov 2018 16:56:26 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2018-07-02; bh=gmXHeNTucBAhP5oOOIcBjLrTNKwY0b2G6CBWPMLTzgY=; b=U8hcULPwR1YXjpnkRNNkZpVOXWU/iuEeoayHLOibvlpvwD2FGOJnGTMgEgA5eTw1DbsH 1S6pGD1I2f50NUSo032DTf3iDHCtq116VySLP+wgy5MjHt5+noiydGbdMavzGqq6MQHe khYC4oYTcWvlJkb2d4WE2BEx3tnHsdBwDVlN8a6vju9tG7lNuSZ9TLqPe//nwRkAwKaM mSqOzUl3Pi7Jzo+qMFqcxrVvI35mTOfRCXOLSizTZjsh7r4Aqbk8nKZmfp/xcRAwDglj 8GZuCo4rFzXVnqeINsOea8+2fWI6VY1o3TW3Qr5bMe6CM8addCfcPxvy/ONPJaUuLqLc 2A== Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by userp2130.oracle.com with ESMTP id 2nh33tr6bv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:26 +0000 Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuPw2005090 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:25 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id wA5GuP8O018749; Mon, 5 Nov 2018 16:56:25 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:25 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node Date: Mon, 5 Nov 2018 11:55:56 -0500 Message-Id: <20181105165558.11698-12-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=2 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Deferred struct page initialization currently runs one thread per node, but this is a bottleneck during boot on big machines, so use ktask within each pgdatinit thread to parallelize the struct page initialization, allowing the system to take better advantage of its memory bandwidth. Because the system is not fully up yet and most CPUs are idle, use more than the default maximum number of ktask threads. The kernel doesn't know the memory bandwidth of a given system to get the most efficient number of threads, so there's some guesswork involved. In testing, a reasonable value turned out to be about a quarter of the CPUs on the node. __free_pages_core used to increase the zone's managed page count by the number of pages being freed. To accommodate multiple threads, however, account the number of freed pages with an atomic shared across the ktask threads and bump the managed page count with it after ktask is finished. Test: Boot the machine with deferred struct page init three times Machine: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz, 88 CPUs, 503G memory, 2 sockets kernel speedup max time per stdev node (ms) baseline (4.15-rc2) 5860 8.6 ktask 9.56x 613 12.4 --- Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory 8 sockets kernel speedup max time per stdev node (ms) baseline (4.15-rc2) 1261 1.9 ktask 3.88x 325 5.0 Signed-off-by: Daniel Jordan Suggested-by: Pavel Tatashin --- mm/page_alloc.c | 91 ++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 78 insertions(+), 13 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ae31839874b8..fe7b681567ba 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -66,6 +66,7 @@ #include #include #include +#include #include #include @@ -1275,7 +1276,6 @@ void __free_pages_core(struct page *page, unsigned int order) set_page_count(p, 0); } - page_zone(page)->managed_pages += nr_pages; set_page_refcounted(page); __free_pages(page, order); } @@ -1340,6 +1340,7 @@ void __init memblock_free_pages(struct page *page, unsigned long pfn, if (early_page_uninitialised(pfn)) return; __free_pages_core(page, order); + page_zone(page)->managed_pages += 1UL << order; } /* @@ -1477,23 +1478,31 @@ deferred_pfn_valid(int nid, unsigned long pfn, return true; } +struct deferred_args { + int nid; + int zid; + atomic64_t nr_pages; +}; + /* * Free pages to buddy allocator. Try to free aligned pages in * pageblock_nr_pages sizes. */ -static void __init deferred_free_pages(int nid, int zid, unsigned long pfn, - unsigned long end_pfn) +static int __init deferred_free_pages(int nid, int zid, unsigned long pfn, + unsigned long end_pfn) { struct mminit_pfnnid_cache nid_init_state = { }; unsigned long nr_pgmask = pageblock_nr_pages - 1; - unsigned long nr_free = 0; + unsigned long nr_free = 0, nr_pages = 0; for (; pfn < end_pfn; pfn++) { if (!deferred_pfn_valid(nid, pfn, &nid_init_state)) { deferred_free_range(pfn - nr_free, nr_free); + nr_pages += nr_free; nr_free = 0; } else if (!(pfn & nr_pgmask)) { deferred_free_range(pfn - nr_free, nr_free); + nr_pages += nr_free; nr_free = 1; touch_nmi_watchdog(); } else { @@ -1502,16 +1511,27 @@ static void __init deferred_free_pages(int nid, int zid, unsigned long pfn, } /* Free the last block of pages to allocator */ deferred_free_range(pfn - nr_free, nr_free); + nr_pages += nr_free; + + return nr_pages; +} + +static int __init deferred_free_chunk(unsigned long pfn, unsigned long end_pfn, + struct deferred_args *args) +{ + unsigned long nr_pages = deferred_free_pages(args->nid, args->zid, pfn, + end_pfn); + atomic64_add(nr_pages, &args->nr_pages); + return KTASK_RETURN_SUCCESS; } /* * Initialize struct pages. We minimize pfn page lookups and scheduler checks * by performing it only once every pageblock_nr_pages. - * Return number of pages initialized. + * Return number of pages initialized in deferred_args. */ -static unsigned long __init deferred_init_pages(int nid, int zid, - unsigned long pfn, - unsigned long end_pfn) +static int __init deferred_init_pages(int nid, int zid, unsigned long pfn, + unsigned long end_pfn) { struct mminit_pfnnid_cache nid_init_state = { }; unsigned long nr_pgmask = pageblock_nr_pages - 1; @@ -1531,7 +1551,17 @@ static unsigned long __init deferred_init_pages(int nid, int zid, __init_single_page(page, pfn, zid, nid); nr_pages++; } - return (nr_pages); + + return nr_pages; +} + +static int __init deferred_init_chunk(unsigned long pfn, unsigned long end_pfn, + struct deferred_args *args) +{ + unsigned long nr_pages = deferred_init_pages(args->nid, args->zid, pfn, + end_pfn); + atomic64_add(nr_pages, &args->nr_pages); + return KTASK_RETURN_SUCCESS; } /* Initialise remaining memory on a node */ @@ -1540,13 +1570,15 @@ static int __init deferred_init_memmap(void *data) pg_data_t *pgdat = data; int nid = pgdat->node_id; unsigned long start = jiffies; - unsigned long nr_pages = 0; + unsigned long nr_init = 0, nr_free = 0; unsigned long spfn, epfn, first_init_pfn, flags; phys_addr_t spa, epa; int zid; struct zone *zone; const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id); u64 i; + unsigned long nr_node_cpus; + struct ktask_node kn; /* Bind memory initialisation thread to a local node if possible */ if (!cpumask_empty(cpumask)) @@ -1560,6 +1592,14 @@ static int __init deferred_init_memmap(void *data) return 0; } + /* + * We'd like to know the memory bandwidth of the chip to calculate the + * most efficient number of threads to start, but we can't. In + * testing, a good value for a variety of systems was a quarter of the + * CPUs on the node. + */ + nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4); + /* Sanity check boundaries */ BUG_ON(pgdat->first_deferred_pfn < pgdat->node_start_pfn); BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat)); @@ -1580,21 +1620,46 @@ static int __init deferred_init_memmap(void *data) * page in __free_one_page()). */ for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) { + struct deferred_args args = { nid, zid, ATOMIC64_INIT(0) }; + DEFINE_KTASK_CTL(ctl, deferred_init_chunk, &args, + KTASK_PTE_MINCHUNK); + ktask_ctl_set_max_threads(&ctl, nr_node_cpus); + spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa)); epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa)); - nr_pages += deferred_init_pages(nid, zid, spfn, epfn); + + kn.kn_start = (void *)spfn; + kn.kn_task_size = (spfn < epfn) ? epfn - spfn : 0; + kn.kn_nid = nid; + (void) ktask_run_numa(&kn, 1, &ctl); + + nr_init += atomic64_read(&args.nr_pages); } for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) { + struct deferred_args args = { nid, zid, ATOMIC64_INIT(0) }; + DEFINE_KTASK_CTL(ctl, deferred_free_chunk, &args, + KTASK_PTE_MINCHUNK); + ktask_ctl_set_max_threads(&ctl, nr_node_cpus); + spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa)); epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa)); - deferred_free_pages(nid, zid, spfn, epfn); + + kn.kn_start = (void *)spfn; + kn.kn_task_size = (spfn < epfn) ? epfn - spfn : 0; + kn.kn_nid = nid; + (void) ktask_run_numa(&kn, 1, &ctl); + + nr_free += atomic64_read(&args.nr_pages); } pgdat_resize_unlock(pgdat, &flags); /* Sanity check that the next zone really is unpopulated */ WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone)); + VM_BUG_ON(nr_init != nr_free); + + zone->managed_pages += nr_free; - pr_info("node %d initialised, %lu pages in %ums\n", nid, nr_pages, + pr_info("node %d initialised, %lu pages in %ums\n", nid, nr_free, jiffies_to_msecs(jiffies - start)); pgdat_init_report_one_done(); From patchwork Mon Nov 5 16:55:57 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668671 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3FF9614BD for ; Mon, 5 Nov 2018 16:57:08 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2CEC8293D9 for ; Mon, 5 Nov 2018 16:57:08 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 20CB42946D; Mon, 5 Nov 2018 16:57:08 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8587B293D9 for ; Mon, 5 Nov 2018 16:57:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 42ADB6B0275; Mon, 5 Nov 2018 11:56:44 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 3D9616B0276; Mon, 5 Nov 2018 11:56:44 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2A5286B0277; Mon, 5 Nov 2018 11:56:44 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yw1-f71.google.com (mail-yw1-f71.google.com [209.85.161.71]) by kanga.kvack.org (Postfix) with ESMTP id E84406B0275 for ; Mon, 5 Nov 2018 11:56:43 -0500 (EST) Received: by mail-yw1-f71.google.com with SMTP id i1-v6so7901757ywd.4 for ; Mon, 05 Nov 2018 08:56:43 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=UTB2Sr3dL/IqP1KEApaT4aDh/kChWZlvDCupI65rZ48=; b=iXaRg2VbHbDEEXGZiJWh+lThIkYXS1iXuh6u4p/alzL+gvqdiIb1bWa/qQAhHwQBSC d5d9/CAqv/znB3G/fqG9WZaMptS6XNvnNNSQlUUUIYdzeGVRRlX3pifkfws0fzIizsNP jSDGbmg/tsWgkMkICGMjfOmZf3FUj2qFV+1CQZ2JxkEuLP+TY5kdPtZ+olarV9lYtml6 7bRw4V/De3PLNyvIHaqQ60o3zj/xtTRKv3+34apFomu3FYvqGAmnpTbg064Dm8wrksQ1 V8eAOENnp8lLovoL8g/5gAEWgAo3tne+dJMxDfdN9GnxJZahiJb0SUP01XsfyxtYPc1i Oa1A== X-Gm-Message-State: AGRZ1gKXKcZE+JkMehyRO6Np8y1I610/2KNmhMADfVKyjK1nYIyqp1mJ 5FeSNZ8sFxPKiPF182lfjTxQEfCBniJSCaVG2L1Gz+wSbiKoMbkBc2Hb3PcfHWUXBgeHvkfhjYq GwnzSgGGm9tjCLb9yLqKH3JYUIO91F3Z+VM22N5rSA80KJfIkGfxsHIGWKCE1rzzTVw== X-Received: by 2002:a25:4486:: with SMTP id r128-v6mr8237247yba.441.1541437003620; Mon, 05 Nov 2018 08:56:43 -0800 (PST) X-Google-Smtp-Source: AJdET5dNv041fUQ89VXgD9QOjGATIf4ratq7fxr95b1A1XpccW5stSqAmrklLnPcx8uBH0fYZp7c X-Received: by 2002:a25:4486:: with SMTP id r128-v6mr8237200yba.441.1541437002865; Mon, 05 Nov 2018 08:56:42 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541437002; cv=none; d=google.com; s=arc-20160816; b=VOGTfKISxJHWN5ruJs6f7YqfXUtePI5NP+dXtWWmPeJfSnSQvMn91kGEKDrvD8pz4H zlBhO5MYYtF7tye8wGk+hdnIH1zIrOMOsvlKPDF3wqBnymm0mlEoyltHgYd1lvysnSJH pkurSlLwX5y6naq2Q8PHMYLCBol6xe/k5Kxr4s9IG6fU5xkKbRzM6zZsIDa7DHSo1eUT AB6R9KlFXIudh6u5Xzk6PhnObknshesqF0pXi8CxuXy2rgDnis8JSUfQBJTZd2lV4eow 81/+SZzLHlQcvBX/TACifoYI1SCLRCwKBfctBTo3mUgCBEUVn2ixb+DTMwjlZE1JSeKZ lK5g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:dkim-signature; bh=UTB2Sr3dL/IqP1KEApaT4aDh/kChWZlvDCupI65rZ48=; b=Pz7ZL4zIfOvfDvPpbqutSPkCaM3C85ylvZAMfZQb4vwHDozf3XLKRJrKRPwAvbdkHL 7Z1L/8vSgSNDgWZKRXBcPoA7eOgBJgxuy/GSDDAHJZ0YYsMrjYnNVPMrTrrf5IMWpvvH dP8yuf4jBUXOHGt93TNqpxGklfqte8LzUE5N9pFmgfuIbjSSrnEgF+cmUpXpRfHUJsYW ODpzHR+BxT2iSgu3KA/n5VbuqezIqNx7RsvoJa1JePabg+klZg9niZHhgSIqwYl7IS9a ni4zlLpRkWYfcloe/D4Y9wHjQ/it+9n9M4qp0Db7R2iVa3/dBHeYG3uE1wcTJVR1pPkH 4OOg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=0QHz6k1R; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.85 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from userp2120.oracle.com (userp2120.oracle.com. [156.151.31.85]) by mx.google.com with ESMTPS id r132-v6si26801412ybc.409.2018.11.05.08.56.42 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 05 Nov 2018 08:56:42 -0800 (PST) Received-SPF: pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.85 as permitted sender) client-ip=156.151.31.85; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=0QHz6k1R; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.85 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5Gs5sA052327; Mon, 5 Nov 2018 16:56:28 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2018-07-02; bh=UTB2Sr3dL/IqP1KEApaT4aDh/kChWZlvDCupI65rZ48=; b=0QHz6k1R+f97qO32+2IlnJ35k/qUdjYjcuv1TKJ4wUIXfkE6+JNJj0qrM0nghND3+w+O cAjLW1fqp98CLgSQLbB6wBLogI18brwjKrrfhq11qDMjTejr785PdYVoH1VxZ2Z/Z0Ze IydMcrlOPWbdmdONU9w42O4UZXjZbXyGe439Fz1ate0YvQ/+SB7ln/O45xJFihP1AfNJ gpG1cyhtQippKedqJxaNV46cYx78VZ6QmEvxrTeeunagc78IXJs7JRQgsruiD7iRwojV 3sUN/yabo7coxLf76IbIeC66BuznidvengcvZ1/T2h08L3xA3Af1c93X59WstgN/g6Ru bg== Received: from aserv0022.oracle.com (aserv0022.oracle.com [141.146.126.234]) by userp2120.oracle.com with ESMTP id 2nh4aqg2e6-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:28 +0000 Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by aserv0022.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuRaO022744 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:27 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id wA5GuRg4018838; Mon, 5 Nov 2018 16:56:27 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:26 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 12/13] mm: parallelize clear_gigantic_page Date: Mon, 5 Nov 2018 11:55:57 -0500 Message-Id: <20181105165558.11698-13-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Parallelize clear_gigantic_page, which zeroes any page size larger than 8M (e.g. 1G on x86). Performance results (the default number of threads is 4; higher thread counts shown for context only): Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory Test: Clear a range of gigantic pages (triggered via fallocate) nthread speedup size (GiB) min time (s) stdev 1 100 41.13 0.03 2 2.03x 100 20.26 0.14 4 4.28x 100 9.62 0.09 8 8.39x 100 4.90 0.05 16 10.44x 100 3.94 0.03 1 200 89.68 0.35 2 2.21x 200 40.64 0.18 4 4.64x 200 19.33 0.32 8 8.99x 200 9.98 0.04 16 11.27x 200 7.96 0.04 1 400 188.20 1.57 2 2.30x 400 81.84 0.09 4 4.63x 400 40.62 0.26 8 8.92x 400 21.09 0.50 16 11.78x 400 15.97 0.25 1 800 434.91 1.81 2 2.54x 800 170.97 1.46 4 4.98x 800 87.38 1.91 8 10.15x 800 42.86 2.59 16 12.99x 800 33.48 0.83 The speedups are mostly due to the fact that more threads can use more memory bandwidth. The loop we're stressing on the x86 chip in this test is clear_page_erms, which tops out at a bandwidth of 2550 MiB/s with one thread. We get the same bandwidth per thread for 2, 4, or 8 threads, but at 16 threads the per-thread bandwidth drops to 1420 MiB/s. However, the performance also improves over a single thread because of the ktask threads' NUMA awareness (ktask migrates worker threads to the node local to the work being done). This becomes a bigger factor as the amount of pages to zero grows to include memory from multiple nodes, so that speedups increase as the size increases. Signed-off-by: Daniel Jordan --- mm/memory.c | 32 ++++++++++++++++++++++++-------- 1 file changed, 24 insertions(+), 8 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 15c417e8e31d..445d06537905 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -69,6 +69,7 @@ #include #include #include +#include #include #include @@ -4415,19 +4416,28 @@ static inline void process_huge_page( } } -static void clear_gigantic_page(struct page *page, - unsigned long addr, - unsigned int pages_per_huge_page) +struct cgp_args { + struct page *base_page; + unsigned long addr; +}; + +static int clear_gigantic_page_chunk(unsigned long start, unsigned long end, + struct cgp_args *args) { - int i; - struct page *p = page; + struct page *base_page = args->base_page; + struct page *p = base_page; + unsigned long addr = args->addr; + unsigned long i; might_sleep(); - for (i = 0; i < pages_per_huge_page; - i++, p = mem_map_next(p, page, i)) { + for (i = start; i < end; ++i) { cond_resched(); clear_user_highpage(p, addr + i * PAGE_SIZE); + + p = mem_map_next(p, base_page, i); } + + return KTASK_RETURN_SUCCESS; } static void clear_subpage(unsigned long addr, int idx, void *arg) @@ -4444,7 +4454,13 @@ void clear_huge_page(struct page *page, ~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1); if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) { - clear_gigantic_page(page, addr, pages_per_huge_page); + struct cgp_args args = {page, addr}; + struct ktask_node node = {0, pages_per_huge_page, + page_to_nid(page)}; + DEFINE_KTASK_CTL(ctl, clear_gigantic_page_chunk, &args, + KTASK_MEM_CHUNK); + + ktask_run_numa(&node, 1, &ctl); return; } From patchwork Mon Nov 5 16:55:58 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668687 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8B2E614E2 for ; Mon, 5 Nov 2018 16:57:33 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 78A1F28A75 for ; Mon, 5 Nov 2018 16:57:33 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 6C80829463; Mon, 5 Nov 2018 16:57:33 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9F8AE28A75 for ; Mon, 5 Nov 2018 16:57:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 64B046B027B; Mon, 5 Nov 2018 11:56:52 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 624C96B027D; Mon, 5 Nov 2018 11:56:52 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4C6456B027E; Mon, 5 Nov 2018 11:56:52 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yw1-f72.google.com (mail-yw1-f72.google.com [209.85.161.72]) by kanga.kvack.org (Postfix) with ESMTP id 190FD6B027B for ; Mon, 5 Nov 2018 11:56:52 -0500 (EST) Received: by mail-yw1-f72.google.com with SMTP id i64-v6so7714848ywa.22 for ; Mon, 05 Nov 2018 08:56:52 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=v1Z2bulRx1p8NREQZCqWa146dHZ3uwVyRP02cw0dAu8=; b=DCMvV3smDPSG/wo/b9hRNi8oCi+KmZFVQ5G1b+kfw8d/8ahm81F+5bh998Xe73xtkC wqKI1oPRLOx43X2BS4Dt6gakZ+SHObot89HKQeS9kmoGSMaXpIklRn/+H082rOLR/A6a MfnUcxgXPd2CSoexD8KWM605u5orylIZ+avbKw/6oRvR12q9kcXw4D2yUx+fvYm5L1lY LJ+cSKKMkNsQfFCn/3JMNOp3+JdhQEWex4BFFUzNtygsJQ4EiwJG0N5pvjC0kRFrpAeR U6Vneuuf7HXTotrf+FWskff+XM73Lwi11ECgVrwxqv6CFSUVjh0OjJLCMPxOBeg1we5A rJSQ== X-Gm-Message-State: AGRZ1gIVEOtqkG9MeRCvBlKdU2qeQlvLEiOleCSsZKR0nOM06Em+OBG9 AEoTNXTbumQSYTRQpu4/CXhD+gTlQ27126awjskjSdxPdOQyZzkAIx6U8Zp4ZEX/uWZFS6NmpXB Fc+P376F5CoVHH0ta6GaVKo5gIQ8s22fc3JydoEkVFGNvRJ+M2aWmbo3j7DiwoUmnlg== X-Received: by 2002:a25:80c7:: with SMTP id c7-v6mr21912503ybm.8.1541437011753; Mon, 05 Nov 2018 08:56:51 -0800 (PST) X-Google-Smtp-Source: AJdET5cQxMK0pBBYRZvSPlfK9qT6lO5rU+Fxfa/BONnCcsbimiUVZ7Z5LLqj/lXtb84jDMdZQMSW X-Received: by 2002:a25:80c7:: with SMTP id c7-v6mr21912445ybm.8.1541437010812; Mon, 05 Nov 2018 08:56:50 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541437010; cv=none; d=google.com; s=arc-20160816; b=Uh0EzgWuQNQ0EvXIujP/xR1qYitzFlNtt9eyy3bwiCdp6+DqW7vCtIFTO+GpRx2vd8 DXMGDiDKlxcidClxlBd7hyopCrSqIgSIN2i+MMCgfHRWWPzOmqPGXtadBWQWk5qfxltW 3QjRDdCTpGL4LXxqp5asU+Nifb/dzM5Q+dgadvNzkLjaqWuInS7y6kxorTHfJJ32xiCe Wwwh0PfisaPvn+09lC3egNJ7CU52VjpdVfz/m/cFyrafCvqWuaWW0eI/OUjqs7c1qX1j CTabhKJ4Rq/B5bv5kt1BPLTtoUrtjtAB7skxG+ZGyGwvZRywCNqoTcFf/pG3bIcjdUkr sojA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:dkim-signature; bh=v1Z2bulRx1p8NREQZCqWa146dHZ3uwVyRP02cw0dAu8=; b=GtOPsH/Ae4Pnb2wHreI+F0TvO4fBrcFvjAB0kXzFqNMxgrbHto6Nj9rFXb+3LdlRpU KoHqzHmNBBowrj+SCV86J1/SzH/ZkhR8ggFaVp+M/mmTNn7+k7N8H2uI1+Upp8/RKyve h0H3t4po9wCKEJXdQkHDloF3m0m4pFaPI+ppwd8Km2R5mig1lbEOvltWz7qG8Z+uuI4W e22lNzJxDXJuO+djh9ldthyGxsVLkeZ5qDUxhmeb2P+OBQ+x89eEeH95xWDD2vfgFRFl hY2ogyNK9lPATewFWzS55ZS3n2p3hv9s3YVbw8WjVtmLuubd99M0Kw9ZR4/XLcU5PB57 E9kw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=LJey9cVL; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 141.146.126.78 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from aserp2120.oracle.com (aserp2120.oracle.com. [141.146.126.78]) by mx.google.com with ESMTPS id p1-v6si22935361ybc.133.2018.11.05.08.56.50 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 05 Nov 2018 08:56:50 -0800 (PST) Received-SPF: pass (google.com: domain of daniel.m.jordan@oracle.com designates 141.146.126.78 as permitted sender) client-ip=141.146.126.78; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=LJey9cVL; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 141.146.126.78 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1]) by aserp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5Gs16d104348; Mon, 5 Nov 2018 16:56:29 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2018-07-02; bh=v1Z2bulRx1p8NREQZCqWa146dHZ3uwVyRP02cw0dAu8=; b=LJey9cVLjtUWlT8pv+rIL3jPjjiTjdUnM/rJYx3ofq0/Pu9jh5LGg+L4HgYxyKWRcxwy H80cv6m8ZsTeHbIoKukaq7mRdt/LIR1ijuJ9KS1+rPxiTMvaqdHBN5UZLl0S/BeZ7xAt m+I3MdsKvofx4UGQWKwRva68Dlz3bJSnVoBBvlyTxW0hnZWhqUNV/MTNZZU6tWAIbOyg 7rYHNNCMRcJiB/4k9v2YgevLe3EmtKH4RhoqirRRf8sPktsptI6NMV4i2ed3GLIQWd2T POVsu5AkmXiCZ6JcC+IX2Kaa/OgRzq33FCm8Aci7faPGG6WJ8IVnPwsMWWo+pdPf8i9Q Xg== Received: from aserv0021.oracle.com (aserv0021.oracle.com [141.146.126.233]) by aserp2120.oracle.com with ESMTP id 2nh3mpg59a-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:29 +0000 Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236]) by aserv0021.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuTG6025303 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:29 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id wA5GuTbr017989; Mon, 5 Nov 2018 16:56:29 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:28 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 13/13] hugetlbfs: parallelize hugetlbfs_fallocate with ktask Date: Mon, 5 Nov 2018 11:55:58 -0500 Message-Id: <20181105165558.11698-14-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP hugetlbfs_fallocate preallocates huge pages to back a file in a hugetlbfs filesystem. The time to call this function grows linearly with size. ktask performs well with its default thread count of 4; higher thread counts are given for context only. Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory Test: fallocate(1) a file on a hugetlbfs filesystem nthread speedup size (GiB) min time (s) stdev 1 200 127.53 2.19 2 3.09x 200 41.30 2.11 4 5.72x 200 22.29 0.51 8 9.45x 200 13.50 2.58 16 9.74x 200 13.09 1.64 1 400 193.09 2.47 2 2.14x 400 90.31 3.39 4 3.84x 400 50.32 0.44 8 5.11x 400 37.75 1.23 16 6.12x 400 31.54 3.13 The primary bottleneck for better scaling at higher thread counts is hugetlb_fault_mutex_table[hash]. perf showed L1-dcache-loads increase with 8 threads and again sharply with 16 threads, and a CPU counter profile showed that 31% of the L1d misses were on hugetlb_fault_mutex_table[hash] in the 16-thread case. Signed-off-by: Daniel Jordan --- fs/hugetlbfs/inode.c | 114 +++++++++++++++++++++++++++++++++++-------- 1 file changed, 93 insertions(+), 21 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 762028994f47..a73548a96061 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -37,6 +37,7 @@ #include #include #include +#include #include @@ -104,11 +105,16 @@ static const struct fs_parameter_description hugetlb_fs_parameters = { }; #ifdef CONFIG_NUMA +static inline struct shared_policy *hugetlb_get_shared_policy( + struct inode *inode) +{ + return &HUGETLBFS_I(inode)->policy; +} + static inline void hugetlb_set_vma_policy(struct vm_area_struct *vma, - struct inode *inode, pgoff_t index) + struct shared_policy *policy, pgoff_t index) { - vma->vm_policy = mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy, - index); + vma->vm_policy = mpol_shared_policy_lookup(policy, index); } static inline void hugetlb_drop_vma_policy(struct vm_area_struct *vma) @@ -116,8 +122,14 @@ static inline void hugetlb_drop_vma_policy(struct vm_area_struct *vma) mpol_cond_put(vma->vm_policy); } #else +static inline struct shared_policy *hugetlb_get_shared_policy( + struct inode *inode) +{ + return NULL; +} + static inline void hugetlb_set_vma_policy(struct vm_area_struct *vma, - struct inode *inode, pgoff_t index) + struct shared_policy *policy, pgoff_t index) { } @@ -576,20 +588,30 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) return 0; } +struct hf_args { + struct file *file; + struct task_struct *parent_task; + struct mm_struct *mm; + struct shared_policy *shared_policy; + struct hstate *hstate; + struct address_space *mapping; + int error; +}; + +static int hugetlbfs_fallocate_chunk(pgoff_t start, pgoff_t end, + struct hf_args *args); + static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) { struct inode *inode = file_inode(file); struct hugetlbfs_inode_info *info = HUGETLBFS_I(inode); - struct address_space *mapping = inode->i_mapping; struct hstate *h = hstate_inode(inode); - struct vm_area_struct pseudo_vma; - struct mm_struct *mm = current->mm; loff_t hpage_size = huge_page_size(h); unsigned long hpage_shift = huge_page_shift(h); - pgoff_t start, index, end; + pgoff_t start, end; + struct hf_args hf_args; int error; - u32 hash; if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) return -EOPNOTSUPP; @@ -617,16 +639,66 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, goto out; } + hf_args.file = file; + hf_args.parent_task = current; + hf_args.mm = current->mm; + hf_args.shared_policy = hugetlb_get_shared_policy(inode); + hf_args.hstate = h; + hf_args.mapping = inode->i_mapping; + hf_args.error = 0; + + if (unlikely(hstate_is_gigantic(h))) { + /* + * Use multiple threads in clear_gigantic_page instead of here, + * so just do a 1-threaded hugetlbfs_fallocate_chunk. + */ + error = hugetlbfs_fallocate_chunk(start, end, &hf_args); + } else { + DEFINE_KTASK_CTL(ctl, hugetlbfs_fallocate_chunk, + &hf_args, KTASK_PMD_MINCHUNK); + + error = ktask_run((void *)start, end - start, &ctl); + } + + if (error != KTASK_RETURN_SUCCESS && hf_args.error != -EINTR) + goto out; + + if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) + i_size_write(inode, offset + len); + inode->i_ctime = current_time(inode); +out: + inode_unlock(inode); + return error; +} + +static int hugetlbfs_fallocate_chunk(pgoff_t start, pgoff_t end, + struct hf_args *args) +{ + struct file *file = args->file; + struct task_struct *parent_task = args->parent_task; + struct mm_struct *mm = args->mm; + struct shared_policy *shared_policy = args->shared_policy; + struct hstate *h = args->hstate; + struct address_space *mapping = args->mapping; + int error = 0; + pgoff_t index; + struct vm_area_struct pseudo_vma; + loff_t hpage_size; + u32 hash; + + hpage_size = huge_page_size(h); + /* * Initialize a pseudo vma as this is required by the huge page * allocation routines. If NUMA is configured, use page index - * as input to create an allocation policy. + * as input to create an allocation policy. Each thread gets its + * own pseudo vma because mempolicies can differ by page. */ vma_init(&pseudo_vma, mm); pseudo_vma.vm_flags = (VM_HUGETLB | VM_MAYSHARE | VM_SHARED); pseudo_vma.vm_file = file; - for (index = start; index < end; index++) { + for (index = start; index < end; ++index) { /* * This is supposed to be the vaddr where the page is being * faulted in, but we have no vaddr here. @@ -641,13 +713,13 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, * fallocate(2) manpage permits EINTR; we may have been * interrupted because we are using up too much memory. */ - if (signal_pending(current)) { + if (signal_pending(parent_task) || signal_pending(current)) { error = -EINTR; - break; + goto err; } /* Set numa allocation policy based on index */ - hugetlb_set_vma_policy(&pseudo_vma, inode, index); + hugetlb_set_vma_policy(&pseudo_vma, shared_policy, index); /* addr is the offset within the file (zero based) */ addr = index * hpage_size; @@ -672,7 +744,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, if (IS_ERR(page)) { mutex_unlock(&hugetlb_fault_mutex_table[hash]); error = PTR_ERR(page); - goto out; + goto err; } clear_huge_page(page, addr, pages_per_huge_page(h)); __SetPageUptodate(page); @@ -680,7 +752,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, if (unlikely(error)) { put_page(page); mutex_unlock(&hugetlb_fault_mutex_table[hash]); - goto out; + goto err; } mutex_unlock(&hugetlb_fault_mutex_table[hash]); @@ -693,11 +765,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, put_page(page); } - if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) - i_size_write(inode, offset + len); - inode->i_ctime = current_time(inode); -out: - inode_unlock(inode); + return KTASK_RETURN_SUCCESS; + +err: + args->error = error; + return error; }