From patchwork Tue Nov  7 08:07:04 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jan Beulich <JBeulich@suse.com>
X-Patchwork-Id: 10045833
Return-Path: <xen-devel-bounces@lists.xen.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	5E2846032D for <patchwork-xen-devel@patchwork.kernel.org>;
	Tue,  7 Nov 2017 08:09:37 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4D9EF1FFB2
	for <patchwork-xen-devel@patchwork.kernel.org>;
	Tue,  7 Nov 2017 08:09:37 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 421F8205A4; Tue,  7 Nov 2017 08:09:37 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-4.2 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_MED
	autolearn=ham version=3.3.1
Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120])
	(using TLSv1.2 with cipher AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 7316B1FFB2
	for <patchwork-xen-devel@patchwork.kernel.org>;
	Tue,  7 Nov 2017 08:09:36 +0000 (UTC)
Received: from localhost ([127.0.0.1] helo=lists.xenproject.org)
	by lists.xenproject.org with esmtp (Exim 4.84_2)
	(envelope-from <xen-devel-bounces@lists.xen.org>)
	id 1eByud-0005tE-KM; Tue, 07 Nov 2017 08:07:11 +0000
Received: from mail6.bemta5.messagelabs.com ([195.245.231.135])
	by lists.xenproject.org with esmtp (Exim 4.84_2)
	(envelope-from <JBeulich@suse.com>) id 1eByud-0005t8-04
	for xen-devel@lists.xenproject.org; Tue, 07 Nov 2017 08:07:11 +0000
Received: from [85.158.139.211] (using TLSv1.2 with cipher
	DHE-RSA-AES256-GCM-SHA384 (256 bits))
	by server-10.bemta-5.messagelabs.com id 72/50-02875-EA9610A5;
	Tue, 07 Nov 2017 08:07:10 +0000
X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFtrJIsWRWlGSWpSXmKPExsXS6fjDS3dtJmO
	UwY9dshbft0xmcmD0OPzhCksAYxRrZl5SfkUCa8aavv9sBa9kKy5cPsTewLhOoouRk0NIIE/i
	3qM9TCA2r4CdxJJrBxhBbAkBQ4nTC2+ygNgsAqoSz79uB6thE1CXaHu2nRXEFhEwkJjbfJW9i
	5GLg1lgDpPEnsVzwJqFBYIlnqxYwAySEBK4wSjx73YP2CROoA1HPtwGSnAAbROU+LtDGCTMLK
	Al8fDXLRYIW1ti2cLXYCXMAtISy/9xTGDkm4XQMAtJwywkDbMQGhYwsqxi1ChOLSpLLdI1MtZ
	LKspMzyjJTczM0TU0MNXLTS0uTkxPzUlMKtZLzs/dxAgMwHoGBsYdjDva/Q4xSnIwKYnybghh
	iBLiS8pPqcxILM6ILyrNSS0+xCjDwaEkwXs6gzFKSLAoNT21Ii0zBxgLMGkJDh4lEd7bIGne4
	oLE3OLMdIjUKUZjjmczXzcwc0y72trELMSSl5+XKiXOWw1SKgBSmlGaBzcIFqOXGGWlhHkZGR
	gYhHgKUotyM0tQ5V8xinMwKgnzPgOZwpOZVwK37xXQKUxAp+wH+YK3uCQRISXVwBivMO/K3Mk
	mF46t6Y/IXf3ONbK8wfGEjEnkniuL5jmvY5+ay7/QQPfCWdvtxy/+v7Tt3mF+tQdL/99j/8So
	vkz+8PeaQCa9ZVuLXjClHjl5aNamn4q/fTR7Wf52r7Lkm1+6Q7snW3qVTvDUza+V+57UfZnbx
	+Fy2kDn8hS13JPHMyvE3p261rRfiaU4I9FQi7moOBEAdRApTswCAAA=
X-Env-Sender: JBeulich@suse.com
X-Msg-Ref: server-8.tower-206.messagelabs.com!1510042027!105925864!1
X-Originating-IP: [137.65.248.74]
X-SpamReason: No, hits=0.0 required=7.0 tests=
X-StarScan-Received: 
X-StarScan-Version: 9.4.45; banners=-,-,-
X-VirusChecked: Checked
Received: (qmail 54764 invoked from network); 7 Nov 2017 08:07:09 -0000
Received: from prv-mh.provo.novell.com (HELO prv-mh.provo.novell.com)
	(137.65.248.74)
	by server-8.tower-206.messagelabs.com with DHE-RSA-AES256-GCM-SHA384
	encrypted SMTP; 7 Nov 2017 08:07:09 -0000
Received: from INET-PRV-MTA by prv-mh.provo.novell.com
	with Novell_GroupWise; Tue, 07 Nov 2017 01:07:06 -0700
Message-Id: <5A0177B8020000780018CCC9@prv-mh.provo.novell.com>
X-Mailer: Novell GroupWise Internet Agent 14.2.2 
Date: Tue, 07 Nov 2017 01:07:04 -0700
From: "Jan Beulich" <JBeulich@suse.com>
To: "Igor Druzhinin" <igor.druzhinin@citrix.com>
References: <58A596C0020000780013AA84@prv-mh.provo.novell.com>
	<58A597D8020000780013AAAF@prv-mh.provo.novell.com>
	<b24ef6a3-a8f6-7da6-0295-6bc96c0725e8@citrix.com>
	<5ca9f140-a574-a8d0-1231-4ce0aec0e124@citrix.com>
In-Reply-To: <5ca9f140-a574-a8d0-1231-4ce0aec0e124@citrix.com>
Mime-Version: 1.0
Content-Disposition: inline
Cc: Sergey Dyasli <sergey.dyasli@citrix.com>,
	Kevin Tian <kevin.tian@intel.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Anshul Makkar <anshul.makkar@citrix.com>, raistlin@linux.it,
	Jun Nakajima <jun.nakajima@intel.com>,
	xen-devel <xen-devel@lists.xenproject.org>
Subject: Re: [Xen-devel] [PATCH v2 1/2] VMX: fix VMCS race on context-switch
	paths
X-BeenThere: xen-devel@lists.xen.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Xen developer discussion <xen-devel.lists.xen.org>
List-Unsubscribe: <https://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <https://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Errors-To: xen-devel-bounces@lists.xen.org
Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>
X-Virus-Scanned: ClamAV using ClamSMTP

>>> On 02.11.17 at 20:46, <igor.druzhinin@citrix.com> wrote:
>> Any ideas about the root cause of the fault and suggestions how to reproduce it
>> would be welcome. Does this crash really has something to do with PML? I doubt
>> because the original environment may hardly be called PML-heavy.

Well, PML-heaviness doesn't matter. It's the mere fact that PML
is enabled on the vCPU being destroyed.

> So we finally have complete understanding of what's going on:
> 
> Some vCPU has just migrated to another pCPU and we switched to idle but
> per_cpu(curr_vcpu) on the current pCPU is still pointing to it - this is
> how the current logic works. While we're in idle we're issuing
> vcpu_destroy() for some other domain which eventually calls
> vmx_vcpu_disable_pml() and trashes VMCS pointer on the current pCPU. At
> this moment we get a TLB flush IPI from that same vCPU which is now
> context switching on another pCPU - it appears to clean TLB after
> itself. This vCPU is already marked is_running=1 by the scheduler. In
> the IPI handler we enter __sync_local_execstate() and trying to call
> vmx_ctxt_switch_from() for the migrated vCPU which is supposed to call
> vmcs_reload() but doesn't do it because is_running==1. The next VMWRITE
> crashes the hypervisor.
> 
> So the state transition diagram might look like:
> pCPU1: vCPUx -> migrate to pCPU2 -> idle -> RCU callbacks ->

I'm not really clear about who/what is "idle" here: pCPU1,
pCPU2, or yet something else? If vCPUx migrated to pCPU2,
wouldn't it be put back into runnable state right away, and
hence pCPU2 can't be idle at this point? Yet for pCPU1 I don't
think its idleness would matter much, i.e. the situation could
also arise without it becoming idle afaics. pCPU1 making it
anywhere softirqs are being processed would suffice.

> vcpu_destroy() -> vmx_vcpu_disable_pml() -> vmcs_clear()
> pCPU2: context switch into vCPUx -> is_running = 1 -> TLB flush
> pCPU1: IPI handler -> context switch out of vCPUx -> VMWRITE -> CRASH!
> 
> We can basically just fix the condition around vmcs_reload() call but
> I'm not completely sure that it's the right way to do - I don't think
> leaving per_cpu(curr_vcpu) pointing to a migrated vCPU is a good idea
> (maybe we need to clean it). What are your thoughts?

per_cpu(curr_vcpu) can only validly be written inside
__context_switch(), hence the only way to achieve this would
be to force __context_switch() to be called earlier than out of
the TLB flush IPI handler, perhaps like in the (untested!) patch
below. Two questions then remain:
- Should we perhaps rather do this in an arch-independent way
  (i.e. ahead of the call to vcpu_destroy() in common code)?
- This deals with only a special case of the more general "TLB
  flush behind the back of a vmx_vmcs_enter() /
  vmx_vmcs_exit() section" - does this need dealing with in a
  more general way? Here I'm thinking of introducing a
  FLUSH_STATE flag to be passed to flush_mask() instead of
  the current flush_tlb_mask() in context_switch() and
  sync_vcpu_execstate(). This could at the same time be used
  for a small performance optimization: At least for HAP vCPU-s
  I don't think we really need the TLB part of the flushes here.

Jan

--- unstable.orig/xen/arch/x86/domain.c
+++ unstable/xen/arch/x86/domain.c
@@ -379,6 +379,14 @@ int vcpu_initialise(struct vcpu *v)
 
 void vcpu_destroy(struct vcpu *v)
 {
+    /*
+     * Flush all state for this vCPU before fully tearing it down. This is
+     * particularly important for HVM ones on VMX, so that this flushing of
+     * state won't happen from the TLB flush IPI handler behind the back of
+     * a vmx_vmcs_enter() / vmx_vmcs_exit() section.
+     */
+    sync_vcpu_execstate(v);
+
     xfree(v->arch.vm_event);
     v->arch.vm_event = NULL;