From patchwork Wed Nov  6 15:58:49 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dario Faggioli <dfaggioli@suse.com>
X-Patchwork-Id: 11230669
Return-Path: <SRS0=qeuA=Y6=lists.xenproject.org=xen-devel-bounces@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 825A41575
	for <patchwork-xen-devel@patchwork.kernel.org>;
 Wed,  6 Nov 2019 16:00:39 +0000 (UTC)
Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPS id 68E6F2178F
	for <patchwork-xen-devel@patchwork.kernel.org>;
 Wed,  6 Nov 2019 16:00:39 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 68E6F2178F
Authentication-Results: mail.kernel.org;
 dmarc=none (p=none dis=none) header.from=suse.com
Authentication-Results: mail.kernel.org;
 spf=none smtp.mailfrom=xen-devel-bounces@lists.xenproject.org
Received: from localhost ([127.0.0.1] helo=lists.xenproject.org)
	by lists.xenproject.org with esmtp (Exim 4.89)
	(envelope-from <xen-devel-bounces@lists.xenproject.org>)
	id 1iSNhv-0007Wa-T1; Wed, 06 Nov 2019 15:58:55 +0000
Received: from us1-rack-iad1.inumbo.com ([172.99.69.81])
 by lists.xenproject.org with esmtp (Exim 4.89) (envelope-from
 <SRS0=3BZZ=Y6=gmail.com=raistlin.df@srs-us1.protection.inumbo.net>)
 id 1iSNhu-0007WV-7y
 for xen-devel@lists.xenproject.org; Wed, 06 Nov 2019 15:58:54 +0000
X-Inumbo-ID: 54504f76-00ae-11ea-b678-bc764e2007e4
Received: from mail-wr1-f66.google.com (unknown [209.85.221.66])
 by us1-rack-iad1.inumbo.com (Halon) with ESMTPS
 id 54504f76-00ae-11ea-b678-bc764e2007e4;
 Wed, 06 Nov 2019 15:58:53 +0000 (UTC)
Received: by mail-wr1-f66.google.com with SMTP id f2so17562037wrs.11
 for <xen-devel@lists.xenproject.org>; Wed, 06 Nov 2019 07:58:53 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:subject:from:to:cc:date:message-id:user-agent
 :mime-version:content-transfer-encoding;
 bh=PTrMImt4y3y/oG8P/8TUZHYOQbTAJi/JSedi2l4rzfU=;
 b=nDcYoJaCmLFJiiqq43yrYpzo1uUw/NIwgOD3N3CNC0XludD2GSBvG6ZxT5+TJ45Pb/
 ZuiRbrYDPrXxis3a8KNUur2TWKSCQleiorN7aCbOjrjMCqeOl8/GYOeRzGpnOIdE4IxP
 7AMIMjQbuEmtNtqIt6HBLa7EyTbw1JghT2a+k9Qoo7EYz72OOgczSfHA8ztLFCFihlYD
 mdsoc77wzguOhQI7vX2VsMn4b8ZNP6tCMDAL8rkUgLR8Pene0cFH5ZFZcNhHzgSlhcXY
 eV/BsVlXmmg87+q64ZPIqPPxD03a+yVm8MJcFDMGi1rIPMZWivl5sTbUpfRdPX2nTSFG
 U8kA==
X-Gm-Message-State: APjAAAX9/+2N9Drzv/dr3iK6XnGgT3oklD/DFccoYXFYXbpuirP4qSlR
 tsSnjbaAX3BRhMSvAnb9H9Q=
X-Google-Smtp-Source: 
 APXvYqwFJIa88hVUuaCstXcW87Xzi0hMl8LNvzLNy+XiAJVQf7YR7rGMe6fCrJgOb3EyLSOVfLFI+g==
X-Received: by 2002:a5d:4585:: with SMTP id p5mr3468703wrq.134.1573055931891;
 Wed, 06 Nov 2019 07:58:51 -0800 (PST)
Received: from [192.168.0.35] (87.78.186.89.cust.ip.kpnqwest.it.
 [89.186.78.87])
 by smtp.gmail.com with ESMTPSA id k125sm3667879wmf.2.2019.11.06.07.58.50
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Wed, 06 Nov 2019 07:58:50 -0800 (PST)
From: Dario Faggioli <dfaggioli@suse.com>
To: xen-devel@lists.xenproject.org
Date: Wed, 06 Nov 2019 16:58:49 +0100
Message-ID: <157305592941.20672.10855835711915878073.stgit@Palanthas>
User-Agent: StGit/0.19
MIME-Version: 1.0
Subject: [Xen-devel] [BUGFIX PATCH for-4.13] sched: fix dom0less boot with
 the null scheduler
X-BeenThere: xen-devel@lists.xenproject.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Xen developer discussion <xen-devel.lists.xenproject.org>
List-Unsubscribe: <https://lists.xenproject.org/mailman/options/xen-devel>,
 <mailto:xen-devel-request@lists.xenproject.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xenproject.org>
List-Help: <mailto:xen-devel-request@lists.xenproject.org?subject=help>
List-Subscribe: <https://lists.xenproject.org/mailman/listinfo/xen-devel>,
 <mailto:xen-devel-request@lists.xenproject.org?subject=subscribe>
Cc: George.Dunlap@eu.citrix.com, julien.grall@arm.com, sstabellini@kernel.org,
 jgross@suse.de
Errors-To: xen-devel-bounces@lists.xenproject.org
Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>

In a dom0less configuration, if the null scheduler is used, the system
may fail to boot, because the loop in null_unit_wake() never exits.

Bisection showed that this behavior occurs since commit d545f1d6 ("xen:
sched: deal with vCPUs being or becoming online or offline") but the
real problem is that, in this case, pick_res() always return the same
CPU.

Fix this by only deal with the simple case, i.e., the vCPU that is
coming online can be assigned to a sched. resource right away, in
null_unit_wake().

If it can't, just add it to the waitqueue, and we will deal with it in
null_schedule(), being careful about not racing with vcpu_wake().

Reported-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
Tested-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
---
 xen/common/sched_null.c |  113 +++++++++++++++++++++++++++++++----------------
 1 file changed, 75 insertions(+), 38 deletions(-)

diff --git a/xen/common/sched_null.c b/xen/common/sched_null.c
index 2525464a7c..88bd11a187 100644
--- a/xen/common/sched_null.c
+++ b/xen/common/sched_null.c
@@ -568,50 +568,52 @@ static void null_unit_wake(const struct scheduler *ops,
     else
         SCHED_STAT_CRANK(unit_wake_not_runnable);
 
+    if ( likely(per_cpu(npc, cpu).unit == unit) )
+    {
+        cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
+        return;
+    }
+
     /*
      * If a unit is neither on a pCPU nor in the waitqueue, it means it was
-     * offline, and that it is now coming back being online.
+     * offline, and that it is now coming back being online. If we're lucky,
+     * and it's previous resource is free (and affinities match), we can just
+     * assign the unit to it (we own the proper lock already) and be done.
      */
-    if ( unlikely(per_cpu(npc, cpu).unit != unit && list_empty(&nvc->waitq_elem)) )
+    if ( per_cpu(npc, cpu).unit == NULL &&
+         unit_check_affinity(unit, cpu, BALANCE_HARD_AFFINITY) )
     {
-        spin_lock(&prv->waitq_lock);
-        list_add_tail(&nvc->waitq_elem, &prv->waitq);
-        spin_unlock(&prv->waitq_lock);
-
-        cpumask_and(cpumask_scratch_cpu(cpu), unit->cpu_hard_affinity,
-                    cpupool_domain_master_cpumask(unit->domain));
-
-        if ( !cpumask_intersects(&prv->cpus_free, cpumask_scratch_cpu(cpu)) )
+        if ( !has_soft_affinity(unit) ||
+             unit_check_affinity(unit, cpu, BALANCE_SOFT_AFFINITY) )
         {
-            dprintk(XENLOG_G_WARNING, "WARNING: d%dv%d not assigned to any CPU!\n",
-                    unit->domain->domain_id, unit->unit_id);
+            unit_assign(prv, unit, cpu);
+            cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
             return;
         }
+    }
 
-        /*
-         * Now we would want to assign the unit to cpu, but we can't, because
-         * we don't have the lock. So, let's do the following:
-         * - try to remove cpu from the list of free cpus, to avoid races with
-         *   other onlining, inserting or migrating operations;
-         * - tickle the cpu, which will pickup work from the waitqueue, and
-         *   assign it to itself;
-         * - if we're racing already, and if there still are free cpus, try
-         *   again.
-         */
-        while ( cpumask_intersects(&prv->cpus_free, cpumask_scratch_cpu(cpu)) )
-        {
-            unsigned int new_cpu = pick_res(prv, unit)->master_cpu;
+    /*
+     * If the resource is not free (or affinities do not match) we need
+     * to assign unit to some other one, but we can't do it here, as:
+     * - we don't own  the proper lock,
+     * - we can't change v->processor under vcpu_wake()'s feet.
+     * So we add it to the waitqueue, and tickle all the free CPUs (if any)
+     * on which unit can run. The first one that schedules will pick it up.
+     */
+    spin_lock(&prv->waitq_lock);
+    list_add_tail(&nvc->waitq_elem, &prv->waitq);
+    spin_unlock(&prv->waitq_lock);
 
-            if ( test_and_clear_bit(new_cpu, &prv->cpus_free) )
-            {
-                cpu_raise_softirq(new_cpu, SCHEDULE_SOFTIRQ);
-                return;
-            }
-        }
-    }
+    cpumask_and(cpumask_scratch_cpu(cpu), unit->cpu_hard_affinity,
+                cpupool_domain_master_cpumask(unit->domain));
+    cpumask_and(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
+                &prv->cpus_free);
 
-    /* Note that we get here only for units assigned to a pCPU */
-    cpu_raise_softirq(sched_unit_master(unit), SCHEDULE_SOFTIRQ);
+    if ( cpumask_empty(cpumask_scratch_cpu(cpu)) )
+        dprintk(XENLOG_G_WARNING, "WARNING: d%dv%d not assigned to any CPU!\n",
+                unit->domain->domain_id, unit->unit_id);
+    else
+        cpumask_raise_softirq(cpumask_scratch_cpu(cpu), SCHEDULE_SOFTIRQ);
 }
 
 static void null_unit_sleep(const struct scheduler *ops,
@@ -827,6 +829,8 @@ static void null_schedule(const struct scheduler *ops, struct sched_unit *prev,
      */
     if ( unlikely(prev->next_task == NULL) )
     {
+        bool unit_found;
+
         spin_lock(&prv->waitq_lock);
 
         if ( list_empty(&prv->waitq) )
@@ -839,6 +843,7 @@ static void null_schedule(const struct scheduler *ops, struct sched_unit *prev,
          * it only in cases where a pcpu has no unit associated (e.g., as
          * said above, the cpu has just joined a cpupool).
          */
+        unit_found = false;
         for_each_affinity_balance_step( bs )
         {
             list_for_each_entry( wvc, &prv->waitq, waitq_elem )
@@ -849,13 +854,45 @@ static void null_schedule(const struct scheduler *ops, struct sched_unit *prev,
 
                 if ( unit_check_affinity(wvc->unit, sched_cpu, bs) )
                 {
-                    unit_assign(prv, wvc->unit, sched_cpu);
-                    list_del_init(&wvc->waitq_elem);
-                    prev->next_task = wvc->unit;
-                    goto unlock;
+                    spinlock_t *lock;
+
+                    unit_found = true;
+
+                    /*
+                     * If the unit in the waitqueue has just come up online,
+                     * we risk racing with vcpu_wake(). To avoid this, sync
+                     * on the spinlock that vcpu_wake() holds, but only with
+                     * trylock, to avoid deadlock).
+                     */
+                    lock = pcpu_schedule_trylock(sched_unit_master(wvc->unit));
+
+                    /*
+                     * We know the vcpu's lock is not this resource's lock. In
+                     * fact, if it were, since this cpu is free, vcpu_wake()
+                     * would have assigned the unit to here directly.
+                     */
+                    ASSERT(lock != get_sched_res(sched_cpu)->schedule_lock);
+
+                    if ( lock ) {
+                        unit_assign(prv, wvc->unit, sched_cpu);
+                        list_del_init(&wvc->waitq_elem);
+                        prev->next_task = wvc->unit;
+                        spin_unlock(lock);
+                        goto unlock;
+                    }
                 }
             }
         }
+        /*
+         * If we did find a unit with suitable affinity in the waitqueue, but
+         * we could not pick it up (due to lock contention), and hence we are
+         * still free, plan for another try. In fact, we don't want such unit
+         * to be stuck in the waitqueue, when there are free cpus where it
+         * could run.
+         */
+        if ( unlikely( unit_found && prev->next_task == NULL &&
+                       !list_empty(&prv->waitq)) )
+            cpu_raise_softirq(cur_cpu, SCHEDULE_SOFTIRQ);
  unlock:
         spin_unlock(&prv->waitq_lock);