From patchwork Mon Nov 5 16:55:45 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Jordan X-Patchwork-Id: 10668661 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4A41114E2 for ; Mon, 5 Nov 2018 16:56:50 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 372F428A75 for ; Mon, 5 Nov 2018 16:56:50 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 2A70729463; Mon, 5 Nov 2018 16:56:50 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5E24928A75 for ; Mon, 5 Nov 2018 16:56:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 613976B0270; Mon, 5 Nov 2018 11:56:38 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 3C80D6B0272; Mon, 5 Nov 2018 11:56:38 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 29EC86B0271; Mon, 5 Nov 2018 11:56:38 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yw1-f71.google.com (mail-yw1-f71.google.com [209.85.161.71]) by kanga.kvack.org (Postfix) with ESMTP id E385F6B026E for ; Mon, 5 Nov 2018 11:56:37 -0500 (EST) Received: by mail-yw1-f71.google.com with SMTP id c64-v6so7914927ywd.1 for ; Mon, 05 Nov 2018 08:56:37 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id:mime-version:content-transfer-encoding; bh=e4nq5W7ylu4kZTMXUCyrfywE4H+lLmEWKsUJ1QdtmAk=; b=lUgVhjAn30K2dgKfgrf6Sj/wtK+2UIcH7FC5bEWtVG+ex4vEarFDejxOsrFLyCqjm9 TbyfnaQ6EyK1OCWO8Z6ZoSw+NhCvNaTvGw0czHosj/8g+gDM+J1Z17rxAqVimBdmIBH6 Tk+m4W0W2hSPwVPXL7WIRoH4463anTar/mXjnecmwi5n5dIllVJDpWbWjENt4QA/3tse SSgctwaHU2QawjhDpU39SDUIxwgiHUcznmHgo+x7WkDmt4pG+AD7DAaFtJpC9HcFC1VX oM8ljc+kDx6MO5+lcm6CU7CkaSFt2UB7vQeIOAP8K/l6bi90FQellFLB3RfgEpzONSH8 s3Wg== X-Gm-Message-State: AGRZ1gJj51bgmugxu9aite2TLE9JVtZN/y8gItAGkHrJW5fkgEnkZckt hVmZ7QMJKhe4gJn5L9Izfpi/gS1iNIXm7S2vHnnnuYOu7+84jGcSiB3iUDi3a9FnLhu86cu8uE1 fmtHQGFoU0RhJbNlCF73EOxy69M93zfDe8ZcZIj9Bdf0/t3PMc4/kZsJmsbC+cFyi8g== X-Received: by 2002:a81:5c8b:: with SMTP id q133-v6mr22011328ywb.355.1541436997520; Mon, 05 Nov 2018 08:56:37 -0800 (PST) X-Google-Smtp-Source: AJdET5dtetLMYFG/aardflRx+sV87QRtkHb9Prj5wUejFXtnPIphHipJ2yLQiXq6zGnxeuzjqMYG X-Received: by 2002:a81:5c8b:: with SMTP id q133-v6mr22011254ywb.355.1541436996354; Mon, 05 Nov 2018 08:56:36 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541436996; cv=none; d=google.com; s=arc-20160816; b=u7Bb9ayfeY5lkikal6dnDIzKXdASJmlM9sfgLbGEvDi/yFWfRBZjEJ6sOpNiaqyULS Ma4WxxaocjRaEkY4yqaBAHnrHyCrPyZrDI6ViyQbQPsst5zXO2XeAasT2DxzhYGJBhmK oQYzTyad7c3ALAXJAk7nUhfhGELMTDNCwHQdFFA3hJ2RgHQ8eumRG5El2MhoAyBVAPYC FuxYaDJOWNZV9PYe/pIv0ltaiErbusx6egwbptsXerTBn8PXct6eNJQB4sNOawTNi12L toA2R1NOkUqHdwNCawmjaj05XjXNiarf46YsR7WFjonkU7KyjJRGluM+40dh+OtX9had 2zBg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:dkim-signature; bh=e4nq5W7ylu4kZTMXUCyrfywE4H+lLmEWKsUJ1QdtmAk=; b=yUMfh6vo3TvSE6awVxcjNBhH7ZLHPI/s5Z9V7SV5hDUSNBYMpoT/L1rNPC1xqwiceO aGIRN85s7ckdxXUHqxfbHUtBIR4eGtIE80CrvUcCerTA8jsy+VGBO/ZQKBrsqfTg4TUs VMJSS6E8N8Ts7rKTLr4ehXpN1zWXHE1X3OF5zSt7TqdN9THm36zx4g2c9iFSGQIgaKnD 4g7BbhW1OoDUF/zOUrLwqCh4K2Ema/Po7a2LTWE2YzWwlmQOXf4ChgEizIcfvxmBTQWD rACJ5+WxJH3hgdjYCf4HiO5cfiewwyfKpPA4s/JIVxrYlYIPySlhjPWcJVtFAHg45MOc XgYQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=3a+91989; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.86 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from userp2130.oracle.com (userp2130.oracle.com. [156.151.31.86]) by mx.google.com with ESMTPS id 10-v6si24302607ywe.234.2018.11.05.08.56.35 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 05 Nov 2018 08:56:36 -0800 (PST) Received-SPF: pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.86 as permitted sender) client-ip=156.151.31.86; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=3a+91989; spf=pass (google.com: domain of daniel.m.jordan@oracle.com designates 156.151.31.86 as permitted sender) smtp.mailfrom=daniel.m.jordan@oracle.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5Gs4im097176; Mon, 5 Nov 2018 16:56:06 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : mime-version : content-type : content-transfer-encoding; s=corp-2018-07-02; bh=e4nq5W7ylu4kZTMXUCyrfywE4H+lLmEWKsUJ1QdtmAk=; b=3a+919899+0bGMO/woZ+DiAZzuoJJapd18aXdHMl+marmb5+NS6qN3LY1Mw7bjxLSaPo 9UdXriaF+d2VPsTc6zCE5tEWw/SFD4omKspePPxhnEJk3On1zSK2bsyAR4H8Z0eMUHAj BuHvJ6Ll0ndY+TpbBJeOmpnGjkQDjMQ7RU8R4hdDdWZrEOUwMCewH6zKd16iePMtdNt+ 72G1jeywyuSDzQlzxddYNR/Tj6RgXKwX59XxqVYsl64aWt3CQZlCHdRbvoTYTGklp9Lz LCPc064n4jNwj9QQJbpxOpOiZnk9//9B68VBZDyLBQBvLCs+2bdLifQrTCqlLAAqJ2Pk 6g== Received: from aserv0021.oracle.com (aserv0021.oracle.com [141.146.126.233]) by userp2130.oracle.com with ESMTP id 2nh33tr6a0-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 16:56:06 +0000 Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by aserv0021.oracle.com (8.14.4/8.14.4) with ESMTP id wA5Gu5LE024075 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 16:56:05 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id wA5Gu4wH018511; Mon, 5 Nov 2018 16:56:04 GMT Received: from localhost.localdomain (/73.60.114.248) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 08:56:04 -0800 From: Daniel Jordan To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz Subject: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work Date: Mon, 5 Nov 2018 11:55:45 -0500 Message-Id: <20181105165558.11698-1-daniel.m.jordan@oracle.com> X-Mailer: git-send-email 2.19.1 MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050153 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Hi, This version addresses some of the feedback from Andrew and Michal last year and describes the plan for tackling the rest. I'm posting now since I'll be presenting ktask at Plumbers next week. Andrew, you asked about parallelizing in more places[0]. This version adds multithreading for VFIO page pinning, and there are more planned users listed below. Michal, you mentioned that ktask should be sensitive to CPU utilization[1]. ktask threads now run at the lowest priority on the system to avoid disturbing busy CPUs (more details in patches 4 and 5). Does this address your concern? The plan to address your other comments is explained below. Alex, any thoughts about the VFIO changes in patches 6-9? Tejun and Lai, what do you think of patch 5? And for everyone, questions and comments welcome. Any suggestions for more users? Thanks, Daniel P.S. This series is big to address the above feedback, but I can send patches 7 and 8 separately. TODO ---- - Implement cgroup-aware unbound workqueues in a separate series, picking up Bandan Das's effort from two years ago[2]. This should hopefully address Michal's comment about running ktask threads within the limits of the calling context[1]. - Make ktask aware of power management. A starting point is to disable the framework when energy-conscious cpufreq settings are enabled (e.g. powersave, conservative scaling governors). This should address another comment from Michal about keeping CPUs under power constraints idle[1]. - Add more users. On my list: - __ib_umem_release in IB core, which Jason Gunthorpe mentioned[3] - XFS quotacheck and online repair, as suggested by Darrick Wong - vfs object teardown at umount time, as Andrew mentioned[0] - page freeing in munmap/exit, as Aaron Lu posted[4] - page freeing in shmem The last three will benefit from scaling zone->lock and lru_lock. - CPU hotplug support for ktask to adjust its per-CPU data and resource limits. - Check with IOMMU folks that iommu_map is safe for all IOMMU backend implementations (it is for x86). Summary ------- A single CPU can spend an excessive amount of time in the kernel operating on large amounts of data. Often these situations arise during initialization- and destruction-related tasks, where the data involved scales with system size. These long-running jobs can slow startup and shutdown of applications and the system itself while extra CPUs sit idle. To ensure that applications and the kernel continue to perform well as core counts and memory sizes increase, harness these idle CPUs to complete such jobs more quickly. ktask is a generic framework for parallelizing CPU-intensive work in the kernel. The API is generic enough to add concurrency to many different kinds of tasks--for example, zeroing a range of pages or evicting a list of inodes--and aims to save its clients the trouble of splitting up the work, choosing the number of threads to use, maintaining an efficient concurrency level, starting these threads, and load balancing the work between them. The first patch has more documentation, and the second patch has the interface. Current users: 1) VFIO page pinning before kvm guest startup (others hitting slowness too[5]) 2) deferred struct page initialization at boot time 3) clearing gigantic pages 4) fallocate for HugeTLB pages This patchset is based on the 2018-10-30 head of mmotm/master. Changelog: v3 -> v4: - Added VFIO page pinning use case (Andrew's "more users" comment) - Made ktask helpers run at the lowest priority on the system (Michal's concern about sensitivity to CPU utilization) - ktask learned to undo part of a task on error (required for VFIO) - Changed mm->locked_vm to an atomic to improve performance for VFIO. This can be split out into a smaller series (there wasn't time before posting this) - Removed /proc/sys/debug/ktask_max_threads (it was a debug-only thing) - Some minor improvements in the ktask code itself (shorter, cleaner, etc) - Updated Documentation to cover these changes v2 -> v3: - Changed cpu to CPU in the ktask Documentation, as suggested by Randy Dunlap - Saved more boot time now that Pavel Tatashin's deferred struct page init patches are in mainline (https://lkml.org/lkml/2017/10/13/692). New performance results in patch 7. - Added resource limits, per-node and system-wide, to maintain efficient concurrency levels (addresses a concern from my Plumbers talk) - ktask no longer allocates memory internally during a task so it can be used in sensitive contexts - Added the option to run work anywhere on the system rather than always confining it to a specific node - Updated Documentation patch with these changes and reworked motivation section v1 -> v2: - Added deferred struct page initialization use case. - Explained the source of the performance improvement from parallelizing clear_gigantic_page (comment from Dave Hansen). - Fixed Documentation and build warnings from CONFIG_KTASK=n kernels. ktask v3 RFC: lkml.kernel.org/r/20171205195220.28208-1-daniel.m.jordan@oracle.com [0] https://lkml.kernel.org/r/20171205142300.67489b1a90605e1089c5aaa9@linux-foundation.org [1] https://lkml.kernel.org/r/20171206143509.GG7515@dhcp22.suse.cz [2] https://lkml.kernel.org/r/1458339291-4093-1-git-send-email-bsd@redhat.com [3] https://lkml.kernel.org/r/20180928153922.GA17076@ziepe.ca [4] https://lkml.kernel.org/r/1489568404-7817-1-git-send-email-aaron.lu@intel.com [5] https://www.redhat.com/archives/vfio-users/2018-April/msg00020.html Daniel Jordan (13): ktask: add documentation ktask: multithread CPU-intensive kernel work ktask: add undo support ktask: run helper threads at MAX_NICE workqueue, ktask: renice helper threads to prevent starvation vfio: parallelize vfio_pin_map_dma mm: change locked_vm's type from unsigned long to atomic_long_t vfio: remove unnecessary mmap_sem writer acquisition around locked_vm vfio: relieve mmap_sem reader cacheline bouncing by holding it longer mm: enlarge type of offset argument in mem_map_offset and mem_map_next mm: parallelize deferred struct page initialization within each node mm: parallelize clear_gigantic_page hugetlbfs: parallelize hugetlbfs_fallocate with ktask Documentation/core-api/index.rst | 1 + Documentation/core-api/ktask.rst | 213 +++++++++ arch/powerpc/kvm/book3s_64_vio.c | 15 +- arch/powerpc/mm/mmu_context_iommu.c | 16 +- drivers/fpga/dfl-afu-dma-region.c | 16 +- drivers/vfio/vfio_iommu_spapr_tce.c | 14 +- drivers/vfio/vfio_iommu_type1.c | 159 ++++--- fs/hugetlbfs/inode.c | 114 ++++- fs/proc/task_mmu.c | 2 +- include/linux/ktask.h | 267 ++++++++++++ include/linux/mm_types.h | 2 +- include/linux/workqueue.h | 5 + init/Kconfig | 11 + init/main.c | 2 + kernel/Makefile | 2 +- kernel/fork.c | 2 +- kernel/ktask.c | 646 ++++++++++++++++++++++++++++ kernel/workqueue.c | 106 ++++- mm/debug.c | 3 +- mm/internal.h | 7 +- mm/memory.c | 32 +- mm/mlock.c | 4 +- mm/mmap.c | 18 +- mm/mremap.c | 6 +- mm/page_alloc.c | 91 +++- 25 files changed, 1599 insertions(+), 155 deletions(-) create mode 100644 Documentation/core-api/ktask.rst create mode 100644 include/linux/ktask.h create mode 100644 kernel/ktask.c