From patchwork Fri Apr 15 21:58:44 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Oliver Upton X-Patchwork-Id: 12815443 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 88E6AC433F5 for ; Fri, 15 Apr 2022 22:00:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:Mime-Version: Message-Id:Date:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:In-Reply-To: References:List-Owner; bh=8QRMwsT5fg9h2ud77toyPQTzvR6ZjnArIxGJs0g3+Qs=; b=TIu 3pl2GzPF2swMwM1P/zgjPnwOpld0DxggdoXmXaPBscgLw2HGNxpn9q0EQbFeOu0r/QIWk6mFZOf32 DSD5FKW2qtNMVpbVO8kiuXYwFgd3menCnyuDtFCxGH7L+qSYaCq1ChztXZumoVxHZK3AEPvKlZ6j3 7LvStvt8fSiKI16JgYsv9STqi1dbNAWKx3opSH1AoXzUSc2an+mcf0OtwTs0DjMQyf8vra8lmtEKB cDmeGlX0ECLuKNRHRwJnsPAyyjT0ePhHyqq4Pnl2ZUXtcCpVRvkrZOwky24kZ2aWMknazehZGLtJk O7otAwGfYt2STWwVfVGGCCYdSg7di+g==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1nfTyE-00BS0u-9J; Fri, 15 Apr 2022 21:59:14 +0000 Received: from mail-io1-xd49.google.com ([2607:f8b0:4864:20::d49]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1nfTyA-00BRys-MN for linux-arm-kernel@lists.infradead.org; Fri, 15 Apr 2022 21:59:12 +0000 Received: by mail-io1-xd49.google.com with SMTP id v9-20020a5ed709000000b006530841a32cso1129273iom.21 for ; Fri, 15 Apr 2022 14:59:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:message-id:mime-version:subject:from:to:cc; bh=azDjsls2ryR6MvWQFRJeidQvMbLl/RHlutDUFGfidAA=; b=St531rEB8UEZ21SaQwTTbIns6sfG2sQXxLuoLYkm2ISxMGI/wJkdfTpDg/wdOjVd6u 5NgtnY1srEpZgvXon4FAZpeLfDIiZr0tsroBjIOvySTRgLGmE1lKquuTRLkiW+BdwiJX u5e2iX5vFXKK6FzVPrRHhi8bmCjQNt180OYAsPlgS58CqOAD+XDVxLvcI5+rz836fDV0 S4qWTFlKLSB3nDd2HmJcWFiqKhkDpj55dxeIduTAEpbTpKgOpI/pcexEL0Gpu4GfY3b/ W09FwZGDQ3ZVIEU8EGV5KGCTli+M0U/j//S4bQF+bDil4Oss9TqOnbR5akxyOmNhJciY 7kbA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:message-id:mime-version:subject:from:to:cc; bh=azDjsls2ryR6MvWQFRJeidQvMbLl/RHlutDUFGfidAA=; b=FNUaOv2kEho6Ktv1IEwSxM2WmM06AKC4ED+odcFsXH/Yx2zBpR0BSIPq6UoNXokP34 lCSHcRnXg9GRXxxBtXZjjEePKzQNxOt0A8oZyhI8vEstJctKvkKjlrYjA3PJ82pxZbxi TJASfAkB7MagHtQIDuUAQXi3pTxsh9rexMCUtu7Kedtr71lqGD6IQdo1PgCh1tri878W AhKXbogH4Vv6lhg8wSoS4eMotR04aMVQQ3gO46CxL2bCL6sSdhKj5AjgjkbAX6RcrrKK pfjeIupVbiqqvEOORkxMHoZT30CKfKPvD+FBCp5V/xSIMGMdbx06Dtg5AuHTwvP5VFi8 U3aQ== X-Gm-Message-State: AOAM530GnSj1vFKqqdgbVmEELRIyquKntCNuFdhlJPmm30a8fUWfzJI6 zkFv0YrnQ3swZAunPq84JJuHfmt9NO8= X-Google-Smtp-Source: ABdhPJyaiGfZk8c4zniStiF0Dq12dWJ0NDJB0+KYENL3HMx1WDG/wo59ZZ+XTTzQKwEfJUnyLOdPyHvyxnU= X-Received: from oupton.c.googlers.com ([fda3:e722:ac3:cc00:2b:ff92:c0a8:404]) (user=oupton job=sendgmr) by 2002:a05:6602:29d2:b0:64c:753c:c490 with SMTP id z18-20020a05660229d200b0064c753cc490mr347648ioq.90.1650059948422; Fri, 15 Apr 2022 14:59:08 -0700 (PDT) Date: Fri, 15 Apr 2022 21:58:44 +0000 Message-Id: <20220415215901.1737897-1-oupton@google.com> Mime-Version: 1.0 X-Mailer: git-send-email 2.36.0.rc0.470.gd361397f0d-goog Subject: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling From: Oliver Upton To: kvmarm@lists.cs.columbia.edu Cc: kvm@vger.kernel.org, Marc Zyngier , James Morse , Alexandru Elisei , Suzuki K Poulose , linux-arm-kernel@lists.infradead.org, Peter Shier , Ricardo Koller , Reiji Watanabe , Paolo Bonzini , Sean Christopherson , Ben Gardon , David Matlack , Oliver Upton X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20220415_145910_777593_14F217D1 X-CRM114-Status: GOOD ( 20.66 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org Presently KVM only takes a read lock for stage 2 faults if it believes the fault can be fixed by relaxing permissions on a PTE (write unprotect for dirty logging). Otherwise, stage 2 faults grab the write lock, which predictably can pile up all the vCPUs in a sufficiently large VM. The x86 port of KVM has what it calls the TDP MMU. Basically, it is an MMU protected by the combination of a read-write lock and RCU, allowing page walkers to traverse in parallel. This series is strongly inspired by the mechanics of the TDP MMU, making use of RCU to protect parallel walks. Note that the TLB invalidation mechanics are a bit different between x86 and ARM, so we need to use the 'break-before-make' sequence to split/collapse a block/table mapping, respectively. Nonetheless, using atomics on the break side allows fault handlers to acquire exclusive access to a PTE (lets just call it locked). Once the PTE lock is acquired it is then safe to assume exclusive access. Special consideration is required when pruning the page tables in parallel. Suppose we are collapsing a table into a block. Allowing parallel faults means that a software walker could be in the middle of a lower level traversal when the table is unlinked. Table walkers that prune the paging structures must now 'lock' all descendent PTEs, effectively asserting exclusive ownership of the substructure (no other walker can install something to an already locked pte). Additionally, for parallel walks we need to punt the freeing of table pages to the next RCU sync, as there could be multiple observers of the table until all walkers exit the RCU critical section. For this I decided to cram an rcu_head into page private data for every table page. We wind up spending a bit more on table pages now, but lazily allocating for rcu callbacks probably doesn't make a lot of sense. Not only would we need a large cache of them (think about installing a level 1 block) to wire up callbacks on all descendent tables, but we also then need to spend memory to actually free memory. I tried to organize these patches as best I could w/o introducing intermediate breakage. The first 5 patches are meant mostly as prepatory reworks, and, in the case of RCU a nop. Patch 6 is quite large, but I had a hard time deciding how to change the way we link/unlink tables to use atomics without breaking things along the way. Patch 7 probably should come before patch 6, as it informs the other read-side fault (perm relax) about when a map is in progress so it'll back off. Patches 8-10 take care of the pruning case, actually locking the child ptes instead of simply dropping table page references along the way. Note that we cannot assume a pte points to a table/page at this point, hence the same helper is called for pre- and leaf-traversal. Guide the recursion based on what got yanked from the PTE. Patches 11-14 wire up everything to schedule rcu callbacks on to-be-freed table pages. rcu_barrier() is called on the way out from tearing down a stage 2 page table to guarantee all memory associated with the VM has actually been cleaned up. Patches 15-16 loop in the fault handler to the new table traversal game. Lastly, patch 17 is a nasty bit of debugging residue to spot possible table page leaks. Please don't laugh ;-) Smoke tested with KVM selftests + kvm_page_table_test w/ 2M hugetlb to exercise the table pruning code. Haven't done anything beyond this, sending as an RFC now to get eyes on the code. Applies to commit fb649bda6f56 ("Merge tag 'block-5.18-2022-04-15' of git://git.kernel.dk/linux-block") Oliver Upton (17): KVM: arm64: Directly read owner id field in stage2_pte_is_counted() KVM: arm64: Only read the pte once per visit KVM: arm64: Return the next table from map callbacks KVM: arm64: Protect page table traversal with RCU KVM: arm64: Take an argument to indicate parallel walk KVM: arm64: Implement break-before-make sequence for parallel walks KVM: arm64: Enlighten perm relax path about parallel walks KVM: arm64: Spin off helper for initializing table pte KVM: arm64: Tear down unlinked page tables in parallel walk KVM: arm64: Assume a table pte is already owned in post-order traversal KVM: arm64: Move MMU cache init/destroy into helpers KVM: arm64: Stuff mmu page cache in sub struct KVM: arm64: Setup cache for stage2 page headers KVM: arm64: Punt last page reference to rcu callback for parallel walk KVM: arm64: Allow parallel calls to kvm_pgtable_stage2_map() KVM: arm64: Enable parallel stage 2 MMU faults TESTONLY: KVM: arm64: Add super lazy accounting of stage 2 table pages arch/arm64/include/asm/kvm_host.h | 5 +- arch/arm64/include/asm/kvm_mmu.h | 2 + arch/arm64/include/asm/kvm_pgtable.h | 14 +- arch/arm64/kvm/arm.c | 4 +- arch/arm64/kvm/hyp/nvhe/mem_protect.c | 13 +- arch/arm64/kvm/hyp/nvhe/setup.c | 13 +- arch/arm64/kvm/hyp/pgtable.c | 518 +++++++++++++++++++------- arch/arm64/kvm/mmu.c | 120 ++++-- 8 files changed, 503 insertions(+), 186 deletions(-)