From patchwork Tue Sep 5 21:42:28 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Xu X-Patchwork-Id: 13375093 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9C103CA100F for ; Tue, 5 Sep 2023 21:42:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3FD498D0001; Tue, 5 Sep 2023 17:42:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3D3AA8E000B; Tue, 5 Sep 2023 17:42:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1CA208D0001; Tue, 5 Sep 2023 17:42:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 09A798D0001 for ; Tue, 5 Sep 2023 17:42:44 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id CC2EB120A77 for ; Tue, 5 Sep 2023 21:42:43 +0000 (UTC) X-FDA: 81203868606.14.1F2F055 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf19.hostedemail.com (Postfix) with ESMTP id 0DA961A0007 for ; Tue, 5 Sep 2023 21:42:39 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=TbwCBz8M; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf19.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1693950160; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=2votGtQ/EmYNPgTamqeLYt96PKfJV24HUJfzovJnoxc=; b=3Qbq5M56etGIfOIpadclQawMIGO0riB91pJJjz+nHhMFdP/l2BI5SXzBIWr/GzagAuo8DN s3P/3J2He6o5nCdcwgJfUc65EC06wd9a7g6kvgDRhBlcOQUIpg8ScpXY42b4iT47O6ySqz bhFy/EllR/QchkvaS/GczDQ+jlZsgt4= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=TbwCBz8M; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf19.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1693950160; a=rsa-sha256; cv=none; b=sDbCEblITEmHisb1hu0cwCPWYk5sttWWSZcWxyU42a+TbcgJsxVQSugJzNbIyEOvCNrEZJ G0Vo6y3OBQCcWFS4T6kfGG/J3lV64on/HNRlyTQH9KUDsQbsCdJ8TNhYWNqjjw75XTWidg 0km0Qq0aNoOnXyNQXcrCBdPQNfxMUU8= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1693950159; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=2votGtQ/EmYNPgTamqeLYt96PKfJV24HUJfzovJnoxc=; b=TbwCBz8MIGGi2CjIVhfivGUCQWbOqqvNji/1te/8P6UH7lTSdYUAVsRhNSR4XwZNX6YPf7 h6h21x25fUAUN36qMtt6TgoFRN3Fd7EmWj9RUMuRSeVAjTfnsbpShdDB8bet3kLPYkXQws 9pazDYsvoOBlLY1FVDGuDSlaVw2WgpI= Received: from mail-qv1-f69.google.com (mail-qv1-f69.google.com [209.85.219.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-675-UjIK9i6YPx6MzqCTrZj4-A-1; Tue, 05 Sep 2023 17:42:38 -0400 X-MC-Unique: UjIK9i6YPx6MzqCTrZj4-A-1 Received: by mail-qv1-f69.google.com with SMTP id 6a1803df08f44-64a4269ca98so7893756d6.0 for ; Tue, 05 Sep 2023 14:42:38 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693950158; x=1694554958; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=2votGtQ/EmYNPgTamqeLYt96PKfJV24HUJfzovJnoxc=; b=TsWuADiPwG23DhfjWCIbOEzaVNhs35HS2OBG//fOHKrgPwsmJGEAVUAVsvcpTmbxzQ +OO875Aa//gIHFFdqH6aCzgs3yQQ9zAo4e51obeECBQU2kKeWUjBcdDeIvfcICwtKPvD z+BFKy6cDRs8P14ulyj2qeKLkbY2bIKesS4ASTQltJtZW81SyjqmyRu7owZ1jRBYxuJl xJOx5bLQ2ySq5b40Q5bIiOxG7pYKieS5Xy/hlxHXCkav+KvHaMc6PFTWunAzZC2+bLhJ 5iJTm4Je1ZQsWJqdJMTPKOtAyAWaG7kp/2dyeQMMYrsMihDYoqhYc8t6bYR3J/ZqYMD/ lLPA== X-Gm-Message-State: AOJu0Yy4zE8Ay7YRXQ8ws3yjvC44+aT8pBhvvNUiGcSVgkCHBPoQAN3q 24ngLumrPMFIPJuXisw+jBpgchGW/2ndmApUPMtFKKNuIntKCol6fzaao4vh9SBH6Sy5e4TlVGi KGiCjODBivbk= X-Received: by 2002:a0c:e9d1:0:b0:653:576d:1ea with SMTP id q17-20020a0ce9d1000000b00653576d01eamr16086406qvo.1.1693950157739; Tue, 05 Sep 2023 14:42:37 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEHBe5ptOe+VReq+OVr1GVxtc25t18NWKbckRbhhKeOXRYvyzzK/YbOev4YaqKhe9uOKa0FSQ== X-Received: by 2002:a0c:e9d1:0:b0:653:576d:1ea with SMTP id q17-20020a0ce9d1000000b00653576d01eamr16086382qvo.1.1693950157387; Tue, 05 Sep 2023 14:42:37 -0700 (PDT) Received: from x1n.redhat.com (cpe5c7695f3aee0-cm5c7695f3aede.cpe.net.cable.rogers.com. [99.254.144.39]) by smtp.gmail.com with ESMTPSA id i2-20020a37c202000000b007682af2c8aasm4396938qkm.126.2023.09.05.14.42.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 05 Sep 2023 14:42:37 -0700 (PDT) From: Peter Xu To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Anish Moorthy , Axel Rasmussen , Alexander Viro , Mike Kravetz , Peter Zijlstra , Andrew Morton , Mike Rapoport , Christian Brauner , peterx@redhat.com, linux-fsdevel@vger.kernel.org, Andrea Arcangeli , Ingo Molnar , James Houghton , Nadav Amit Subject: [PATCH 0/7] mm/userfaultfd/poll: Scale userfaultfd wakeups Date: Tue, 5 Sep 2023 17:42:28 -0400 Message-ID: <20230905214235.320571-1-peterx@redhat.com> X-Mailer: git-send-email 2.41.0 MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com X-Rspam-User: X-Stat-Signature: 1fjj49y1ej6zt6as9aofnk4wr954xhno X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 0DA961A0007 X-HE-Tag: 1693950159-672399 X-HE-Meta: U2FsdGVkX19hvCfQPRQ0CtCR7t96UAFGaSWo2JsQnQ+Mcg3wL+SSQpv78eYv395gIba2+4IHCOjXb4HxAME0IZa/ejNPdnm/CmG5pju2YzTNPz+0VTTvhasX0sS4v6IMiFIUuUk6T6jo3gsUDvE1XtVCgmoOPyhVK736FpqIZ9aVpILNDcrh/6QWUXbsN5vcCc59lmyeVLmZ3eKLWdbh3/8ONhJTBkhO5oSyuTqToau8QqR/C71JkSXeYfQG70eG6rSZAKn2AS7vBD0JWYB0w6IQjVO2FPK0HrkCaOTMjS0/0PQMdrdtiFNY8O7B0/IlPQW6rPdZy6D413xCrlZNPyJ+B7XE53/a0gltd+86+nJxbvlJFla93+sT/gJFNSPkT5sg0BdJP6StZo5eAEYQN0q+UFogKvG9Lje3QEqyrj4vbBnI/2ielycKVFC4tzPKGkEdQ1/VmkqCe4rPqntusdeWc1ode4G/Op+drvjRhuzSheO3tgB/TTXjtqolA3wgn0xIrirC7zDAcsJTRhSZxJayj4avlBETz3wtCeUR+EABtsWeeZZqjbNW42B/fS5IRqdCLGvfGd5nZeFQcbky/8vmTNn5acBMLdy9Ge+vN52XOU/qog275Pqr3l+thAR5003sw0DezVt5wOOfHQq++y1jAdjxxrsaqd08DxrCLv75l2hXIXmuQ6H+ESFmiDSIAtnbWIRXosjBdSFY3AXJJ87++bk7SZhDx/2VLOj3R+BZvCgoLG9UMue0/8hWgJJddCaT2IMmwiMdKhLEy9LqlNdKizikEMTUnCsdwKh4YkAUEN5Kz1jfESX00NZjJEM2W/4kE8EGkdhoQkBkfbkr23Ou3uNEaO7OS5zSyEvDzxztXxJl/40d/6VQg9Wv29uQGFjcRkvkanyIwwgBOCC0dm2FFFOPEQbL3LdtQEXZo9Po1DGw6Xz4iZ8hUm/FKUi9nlR3PNgiWXejMokTQnB xdrVPlMe Mrlgird7aBN1g3jAL9PTphHYhx7HKO3Vd1idbmb/pp8p+JThPBR/6ea3bDtJBjPYzM0qkMMQHmpM3Xjq1KMFbhrWhEgoQfSK5PJDzUrnmk0YaC7/vGqHmXAQdJQgLOalDxQsA7CgGCW76YwA7X/53vo3YxRRva0gIW6BzKi1LC3ZLKCihmn0wsc3HtBei1mXkuQ1MIOPrtKKeZ5cnpM1pTcuw8bmVEvAGQUkObTLUcDXY6nzJ9b/dwelEuZPwfuwiUttpFHokH/I5kMpMlyizX6NhZICLCy/k6nehnAT1dFbV96ufHF+W5B5jVtkXDqogXaO/sMpwBb6HZaoIXddDUVTj5DmrJf5V32xrb2ZlOCA9mG1iO8YfX/qe2kaSzyWUmoPKlaTCCNe835dTqXt/JhdVCSI8t7T5vMx8WzMecbbbbroS06RZK10Ihw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Userfaultfd is the type of file that doesn't need wake-all semantics: if there is a message enqueued (for either a fault address, or an event), we only need to wake up one service thread to handle it. Waking up more normally means a waste of cpu cycles. Besides that, and more importantly, that just doesn't scale. Andrea used to have one patch that made read() to be O(1) but never hit upstream. This is my effort to try upstreaming that (which is a oneliner..), meanwhile on top of that I also made poll() O(1) on wakeup, too (more or less bring EPOLLEXCLUSIVE to poll()), with some tests showing that effect. To verify this, I added a test called uffd-perf (leveraging the refactored uffd selftest suite) that will measure the messaging channel latencies on wakeups, and the waitqueue optimizations can be reflected by the new test: Constants: 40 uffd threads, on N_CPUS=40, memsize=512M Units: milliseconds (to finish the test) |-----------------+--------+-------+------------| | test case | before | after | diff (%) | |-----------------+--------+-------+------------| | workers=8,poll | 1762 | 1133 | -55.516328 | | workers=8,read | 1437 | 585 | -145.64103 | | workers=16,poll | 1117 | 1097 | -1.8231541 | | workers=16,read | 1159 | 759 | -52.700922 | | workers=32,poll | 1001 | 973 | -2.8776978 | | workers=32,read | 866 | 713 | -21.458626 | |-----------------+--------+-------+------------| The more threads hanging on the fd_wqh, a bigger difference will be there shown in the numbers. "8 worker threads" is the worst case here because it means there can be a worst case of 40-8=32 threads hanging idle on fd_wqh queue. In real life, workers can be more than this, but small number of active worker threads will cause similar effect. This is currently based on Andrew's mm-unstable branch, but assuming this is applicable to most of the not-so-old trees. Comments welcomed, thanks. Andrea Arcangeli (1): mm/userfaultfd: Make uffd read() wait event exclusive Peter Xu (6): poll: Add a poll_flags for poll_queue_proc() poll: POLL_ENQUEUE_EXCLUSIVE fs/userfaultfd: Use exclusive waitqueue for poll() selftests/mm: Replace uffd_read_mutex with a semaphore selftests/mm: Create uffd_fault_thread_create|join() selftests/mm: uffd perf test drivers/vfio/virqfd.c | 4 +- drivers/vhost/vhost.c | 2 +- drivers/virt/acrn/irqfd.c | 2 +- fs/aio.c | 2 +- fs/eventpoll.c | 2 +- fs/select.c | 9 +- fs/userfaultfd.c | 8 +- include/linux/poll.h | 25 ++- io_uring/poll.c | 4 +- mm/memcontrol.c | 4 +- net/9p/trans_fd.c | 3 +- tools/testing/selftests/mm/Makefile | 2 + tools/testing/selftests/mm/uffd-common.c | 65 +++++++ tools/testing/selftests/mm/uffd-common.h | 7 + tools/testing/selftests/mm/uffd-perf.c | 207 +++++++++++++++++++++++ tools/testing/selftests/mm/uffd-stress.c | 53 +----- virt/kvm/eventfd.c | 2 +- 17 files changed, 337 insertions(+), 64 deletions(-) create mode 100644 tools/testing/selftests/mm/uffd-perf.c