From patchwork Thu Jun 8 09:30:24 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Wanpeng Li X-Patchwork-Id: 9774283 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id E21D360350 for ; Thu, 8 Jun 2017 09:31:36 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D6EC12850E for ; Thu, 8 Jun 2017 09:31:36 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id CB99128558; Thu, 8 Jun 2017 09:31:36 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, FREEMAIL_FROM, RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 34F3128553 for ; Thu, 8 Jun 2017 09:31:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751825AbdFHJae (ORCPT ); Thu, 8 Jun 2017 05:30:34 -0400 Received: from mail-pf0-f196.google.com ([209.85.192.196]:36169 "EHLO mail-pf0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751686AbdFHJa3 (ORCPT ); Thu, 8 Jun 2017 05:30:29 -0400 Received: by mail-pf0-f196.google.com with SMTP id y7so4494626pfd.3; Thu, 08 Jun 2017 02:30:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=ja7roHL0lXkKy6ptJd3g2L4tb7FtZX0lqRPt88y2LAU=; b=EQ7ZDzGPN1holq+iUURu954IPyf/4m4+hswEVZzBrBQTynWCQwhm4pBB8/dTUBJHfo gtGonJyBTkxYJYG0dI975PMe7sThfyUXxY3o4ymrBn/iB2gCoZZL59y1efqH2p9APkgC zirO6jE9EiqiruUC7pBH8Q9+umIwzOA8LmtGU2BmsZnJMIz80zet9UJgs2RW0vDTMoaZ eS9SeRIhTy7DEs7zWYeyfe75Cq1sktFZnwJYeX+626t7uzW6CxI2AJLUe31h8TJHwL/C 1acJhESgsNzq7EddMDq+RkVv1TC+UEivB3LUeS0E4BEtvMDHgm9XkGR/FxcgM+o9haPI nz0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=ja7roHL0lXkKy6ptJd3g2L4tb7FtZX0lqRPt88y2LAU=; b=lyx5wQfydMkdiaRzBk5ICiIpNaKbyOcFv5L/L0eLDkhqhbreAH7A8NpSfxg8aOCD4r W6K1xJp+vDVKgQ3V6R+nkaF5H6zxha5jCLYJUhE0wYr3MtM5xqTvcUrKC5O00U4t0HlO s8Nd+72muLRLx3KpsEruU/Hp4Q4Fo7EtuYJu7Ne7bwXME5BHRsXSIYALIuyNIfc9YZfL KGR1LuW0WE5zQSGmkz3x+c9+ETCNSrNm4nqTqenLaR6gc7eOD6d0FPJ10lDfY+9Fq8HT ZmlT5eoxR8yxZCYBdjoZD7HXBjGVMP6oxkbWm+tZ+9xVdQOdzAsHIkoJgjSOHuP82xHp NwBw== X-Gm-Message-State: AODbwcD/hX7J3nNmwJfti8i0dXrVm0Z93fvDuzZ3bMwZyXNm1GubpF5p iS+LtGTzx67LbEDC X-Received: by 10.99.64.1 with SMTP id n1mr36778139pga.197.1496914228354; Thu, 08 Jun 2017 02:30:28 -0700 (PDT) Received: from localhost ([203.205.141.123]) by smtp.gmail.com with ESMTPSA id z64sm8787988pfd.20.2017.06.08.02.30.27 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 08 Jun 2017 02:30:27 -0700 (PDT) From: Wanpeng Li X-Google-Original-From: Wanpeng Li To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Paolo Bonzini , =?UTF-8?q?Radim=20Kr=C4=8Dm=C3=A1=C5=99?= , Wanpeng Li Subject: [PATCH RFC] KVM: async_pf: fix async_pf exception injection Date: Thu, 8 Jun 2017 02:30:24 -0700 Message-Id: <1496914224-87730-1-git-send-email-wanpeng.li@hotmail.com> X-Mailer: git-send-email 2.7.4 MIME-Version: 1.0 Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP INFO: task gnome-terminal-:1734 blocked for more than 120 seconds. Not tainted 4.12.0-rc4+ #8 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. gnome-terminal- D 0 1734 1015 0x00000000 Call Trace: __schedule+0x3cd/0xb30 schedule+0x40/0x90 kvm_async_pf_task_wait+0x1cc/0x270 ? __vfs_read+0x37/0x150 ? prepare_to_swait+0x22/0x70 do_async_page_fault+0x77/0xb0 ? do_async_page_fault+0x77/0xb0 async_page_fault+0x28/0x30 This is triggered by running both win7 and win2016 on L1 KVM simultaneously, and then gives stress to memory on L1, I can observed this hang on L1 when at least ~70% swap area is occupied on L0. This is due to async pf was injected to L2 which should be injected to L1, L2 guest starts receiving pagefault w/ bogus %cr2(apf token from the host actually), and L1 guest starts accumulating tasks stuck in D state in kvm_async_pf_task_wait(). I try to fix it according to Radim's proposal "force a nested VM exit from nested_vmx_check_exception if the injected #PF is async_pf and handle the #PF VM exit in L1". https://www.spinics.net/lists/kvm/msg142498.html However, I found that "nr == PF_VECTOR && vmx->apf_reason != 0" never be true in nested_vmx_check_exception(). SVM depends on the similar stuff in nested_svm_intercept() which makes me confusing how it can works. In addition, vmx/svm->apf_reason should be got in L1 since apf_reason.reason will make sense just in pv guest. So vmx/svm->apf_reason should always be 0 on L0. I change the condition to "nr == PF_VECTOR && error_code == 0" to intercept async_pf, however, the below bug will be splatted: BUG: unable to handle kernel paging request at ffffe305770a87e0 IP: kfree+0x6f/0x300 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP CPU: 3 PID: 2187 Comm: transhuge-stres Tainted: G OE 4.12.0-rc4+ #9 task: ffff8a9214b58000 task.stack: ffffb46bc34e4000 RIP: 0010:kfree+0x6f/0x300 RSP: 0000:ffffb46bc34e7b28 EFLAGS: 00010086 RAX: ffffe305770a87c0 RBX: ffffb46bc2a1fe70 RCX: 0000000000000001 RDX: 0000757180000000 RSI: 00000000ffffffff RDI: 0000000000000096 RBP: ffffb46bc34e7b50 R08: 0000000000000000 R09: 0000000000000001 R10: ffffb46bc34e7ac8 R11: 68b9962a00000000 R12: 000000a7770a87c0 R13: ffffffff90059b75 R14: ffffffff913466c0 R15: ffffe25e06f18000 FS: 00007f1904ae7700(0000) GS:ffff8a921a800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffe305770a87e0 CR3: 000000040eb5c000 CR4: 00000000001426e0 Call Trace: kvm_async_pf_task_wait+0xd5/0x280 ? __this_cpu_preempt_check+0x13/0x20 do_async_page_fault+0x77/0xb0 ? do_async_page_fault+0x77/0xb0 async_page_fault+0x28/0x30 In additon, if svm->apf_reason doen't make sense on L0, then maybe it also will not work in the function nested_svm_exit_special(). The patch below is uncompleted, and your inputs to improve it is a great appreciated. Cc: Paolo Bonzini Cc: Radim Krčmář Signed-off-by: Wanpeng Li --- arch/x86/kvm/vmx.c | 41 ++++++++++++++++++++++++++++++++--------- 1 file changed, 32 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index ca5d2b9..21a1b44 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -616,6 +616,7 @@ struct vcpu_vmx { bool emulation_required; u32 exit_reason; + u32 apf_reason; /* Posted interrupt descriptor */ struct pi_desc pi_desc; @@ -2418,11 +2419,12 @@ static void skip_emulated_instruction(struct kvm_vcpu *vcpu) * KVM wants to inject page-faults which it got to the guest. This function * checks whether in a nested guest, we need to inject them to L1 or L2. */ -static int nested_vmx_check_exception(struct kvm_vcpu *vcpu, unsigned nr) +static int nested_vmx_check_exception(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code) { struct vmcs12 *vmcs12 = get_vmcs12(vcpu); - if (!(vmcs12->exception_bitmap & (1u << nr))) + if (!((vmcs12->exception_bitmap & (1u << nr)) || + (nr == PF_VECTOR && error_code == 0))) return 0; nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI, @@ -2439,7 +2441,7 @@ static void vmx_queue_exception(struct kvm_vcpu *vcpu, unsigned nr, u32 intr_info = nr | INTR_INFO_VALID_MASK; if (!reinject && is_guest_mode(vcpu) && - nested_vmx_check_exception(vcpu, nr)) + nested_vmx_check_exception(vcpu, nr, error_code)) return; if (has_error_code) { @@ -5646,14 +5648,31 @@ static int handle_exception(struct kvm_vcpu *vcpu) } if (is_page_fault(intr_info)) { - /* EPT won't cause page fault directly */ - BUG_ON(enable_ept); cr2 = vmcs_readl(EXIT_QUALIFICATION); - trace_kvm_page_fault(cr2, error_code); + switch (vmx->apf_reason) { + default: + /* EPT won't cause page fault directly */ + BUG_ON(enable_ept); + trace_kvm_page_fault(cr2, error_code); - if (kvm_event_needs_reinjection(vcpu)) - kvm_mmu_unprotect_page_virt(vcpu, cr2); - return kvm_mmu_page_fault(vcpu, cr2, error_code, NULL, 0); + if (kvm_event_needs_reinjection(vcpu)) + kvm_mmu_unprotect_page_virt(vcpu, cr2); + return kvm_mmu_page_fault(vcpu, cr2, error_code, NULL, 0); + break; + case KVM_PV_REASON_PAGE_NOT_PRESENT: + vmx->apf_reason = 0; + local_irq_disable(); + kvm_async_pf_task_wait(cr2); + local_irq_enable(); + break; + case KVM_PV_REASON_PAGE_READY: + vmx->apf_reason = 0; + local_irq_disable(); + kvm_async_pf_task_wake(cr2); + local_irq_enable(); + break; + } + return 0; } ex_no = intr_info & INTR_INFO_VECTOR_MASK; @@ -8600,6 +8619,10 @@ static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx) vmx->exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); exit_intr_info = vmx->exit_intr_info; + /* if exit due to PF check for async PF */ + if (is_page_fault(exit_intr_info)) + vmx->apf_reason = kvm_read_and_reset_pf_reason(); + /* Handle machine checks before interrupts are enabled */ if (is_machine_check(exit_intr_info)) kvm_machine_check();