From patchwork Thu Mar 12 13:44:14 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Dario Faggioli X-Patchwork-Id: 11434309 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6E495913 for ; Thu, 12 Mar 2020 13:45:33 +0000 (UTC) Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 51A782067C for ; Thu, 12 Mar 2020 13:45:33 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 51A782067C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=xen-devel-bounces@lists.xenproject.org Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.89) (envelope-from ) id 1jCO8I-0002Sz-7H; Thu, 12 Mar 2020 13:44:18 +0000 Received: from us1-rack-iad1.inumbo.com ([172.99.69.81]) by lists.xenproject.org with esmtp (Exim 4.89) (envelope-from ) id 1jCO8H-0002Sr-6L for xen-devel@lists.xenproject.org; Thu, 12 Mar 2020 13:44:17 +0000 X-Inumbo-ID: 90c17b84-6467-11ea-bec1-bc764e2007e4 Received: from mail-wm1-f65.google.com (unknown [209.85.128.65]) by us1-rack-iad1.inumbo.com (Halon) with ESMTPS id 90c17b84-6467-11ea-bec1-bc764e2007e4; Thu, 12 Mar 2020 13:44:16 +0000 (UTC) Received: by mail-wm1-f65.google.com with SMTP id e26so6353329wme.5 for ; Thu, 12 Mar 2020 06:44:16 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:from:to:cc:date:message-id:in-reply-to :references:user-agent:mime-version:content-transfer-encoding; bh=98rImuk8IgJ1Sk8C9NZCFORwdYVoL50Apj68nHt8iOs=; b=gbfGvL90gexOJw1vxigawF0PoGDzt7oJ3RWt9TovQVgJ/JXAhjg6aaYD7GftauD6ow ON+jC4as1D9kx+I+IKuZqW/TZUEoLw6ppBeWYjyzthGbElcbKzYmlbkxhe2qP9fsYJM0 QTGQiZcycy+6htYNHds4PR2rwNxz7zcOpNvw+xsGUaCGvgPOHA2ZRz4tM2pvsIkgMrSQ qtpvOYA4TA5qCHLp16svU07dW+L4rxenGpLwFekfyfrUqYLPrOLyLyh2kteG880PFdF0 1U7ajVFzONMHWngdLc96EDj4Zq9eZeQq6vjpOFN6ltFqJmf9I0ZItWMHGR5MsD59aGLK E2MQ== X-Gm-Message-State: ANhLgQ35FkB3CUimMoVSyHsmCeI0l8D/TKM3gxIrMyt9ly21CgXJatwi tRf5dT9WuPTeICNhCik8X4AWjJ3NQl4= X-Google-Smtp-Source: ADFU+vuw0C1Ft5UK6/Rw6nUF9pEZrM6dQcfnjCcOOEZJOV2XMB+TEKCabEoti3XVGNLtP12LdlMx6g== X-Received: by 2002:a1c:b686:: with SMTP id g128mr166104wmf.75.1584020655642; Thu, 12 Mar 2020 06:44:15 -0700 (PDT) Received: from [192.168.0.36] (87.78.186.89.cust.ip.kpnqwest.it. [89.186.78.87]) by smtp.gmail.com with ESMTPSA id 7sm1086475wmf.20.2020.03.12.06.44.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 12 Mar 2020 06:44:15 -0700 (PDT) From: Dario Faggioli To: xen-devel@lists.xenproject.org Date: Thu, 12 Mar 2020 14:44:14 +0100 Message-ID: <158402065414.753.15785539969715690913.stgit@Palanthas> In-Reply-To: <158402056376.753.7091379488590272336.stgit@Palanthas> References: <158402056376.753.7091379488590272336.stgit@Palanthas> User-Agent: StGit/0.21 MIME-Version: 1.0 Subject: [Xen-devel] [PATCH 1/2] xen: credit2: avoid vCPUs to ever reach lower credits than idle X-BeenThere: xen-devel@lists.xenproject.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Cc: Juergen Gross , Charles Arnold , Tomas Mozes , Glen , George Dunlap , Jan Beulich , Sarah Newman Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" There have been report of stalls of guest vCPUs, when Credit2 was used. It seemed like these vCPUs were not getting scheduled for very long time, even under light load conditions (e.g., during dom0 boot). Investigations led to the discovery that --although rarely-- it can happen that a vCPU manages to run for very long timeslices. In Credit2, this means that, when runtime accounting happens, the vCPU will lose a large quantity of credits. This in turn may lead to the vCPU having less credits than the idle vCPUs (-2^30). At this point, the scheduler will pick the idle vCPU, instead of the ready to run vCPU, for a few "epochs", which often times is enough for the guest kernel to think the vCPU is not responding and crashing. An example of this situation is shown here. In fact, we can see d0v1 sitting in the runqueue while all the CPUs are idle, as it has -1254238270 credits, which is smaller than -2^30 = −1073741824: (XEN) Runqueue 0: (XEN) ncpus = 28 (XEN) cpus = 0-27 (XEN) max_weight = 256 (XEN) pick_bias = 22 (XEN) instload = 1 (XEN) aveload = 293391 (~111%) (XEN) idlers: 00,00000000,00000000,00000000,00000000,00000000,0fffffff (XEN) tickled: 00,00000000,00000000,00000000,00000000,00000000,00000000 (XEN) fully idle cores: 00,00000000,00000000,00000000,00000000,00000000,0fffffff [...] (XEN) Runqueue 0: (XEN) CPU[00] runq=0, sibling=00,..., core=00,... (XEN) CPU[01] runq=0, sibling=00,..., core=00,... [...] (XEN) CPU[26] runq=0, sibling=00,..., core=00,... (XEN) CPU[27] runq=0, sibling=00,..., core=00,... (XEN) RUNQ: (XEN) 0: [0.1] flags=0 cpu=5 credit=-1254238270 [w=256] load=262144 (~100%) We certainly don't want, under any circumstance, this to happen. Therefore, let's use INT_MIN for the credits of the idle vCPU, in Credit2, to be sure that no vCPU can get below that value. NOTE: investigations have been done about _how_ it is possible for a vCPU to execute for so long that its credits becomes so low. While still not completely clear, there are evidence that: - it only happens very rarely - it appears to be both machine and workload specific - it does not look to be a Credit2 (e.g., as it happens when running with Credit1 as well) issue, or a scheduler issue This patch makes Credit2 more robust to events like this, whatever the cause is, and should hence be backported (as far as possible). Signed-off-by: Dario Faggioli Reported-by: Glen Reported-by: Tomas Mozes Acked-by: George Dunlap --- Cc: George Dunlap Cc: Juergen Gross Cc: Jan Beulich Cc: Charles Arnold Cc: Sarah Newman --- I will provide the backports myself, at least for 4.13 and 4.12.x (and feel free to ask for more). --- For Sarah, looking back at the various threads, I am not quite sure whether you also experienced the issue and reported it. If yes, I'm happy to add a "Reported-by:" line about you too (or, if this is fine to go in, for this to be done while committing, if possible). --- xen/common/sched/credit2.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/xen/common/sched/credit2.c b/xen/common/sched/credit2.c index c7241944a8..5c0ab9cd05 100644 --- a/xen/common/sched/credit2.c +++ b/xen/common/sched/credit2.c @@ -234,7 +234,7 @@ * units does not consume credits, and it must be lower than whatever * amount of credit 'regular' unit would end up with. */ -#define CSCHED2_IDLE_CREDIT (-(1U<<30)) +#define CSCHED2_IDLE_CREDIT INT_MIN /* * Carryover: How much "extra" credit may be carried over after * a reset. From patchwork Thu Mar 12 13:44:20 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dario Faggioli X-Patchwork-Id: 11434307 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D65D91668 for ; Thu, 12 Mar 2020 13:45:32 +0000 (UTC) Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id A8D2E2067C for ; Thu, 12 Mar 2020 13:45:32 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A8D2E2067C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=xen-devel-bounces@lists.xenproject.org Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.89) (envelope-from ) id 1jCO8P-0002UX-Fa; Thu, 12 Mar 2020 13:44:25 +0000 Received: from us1-rack-iad1.inumbo.com ([172.99.69.81]) by lists.xenproject.org with esmtp (Exim 4.89) (envelope-from ) id 1jCO8O-0002UJ-CA for xen-devel@lists.xenproject.org; Thu, 12 Mar 2020 13:44:24 +0000 X-Inumbo-ID: 94ed6498-6467-11ea-b34e-bc764e2007e4 Received: from mail-wm1-f65.google.com (unknown [209.85.128.65]) by us1-rack-iad1.inumbo.com (Halon) with ESMTPS id 94ed6498-6467-11ea-b34e-bc764e2007e4; Thu, 12 Mar 2020 13:44:23 +0000 (UTC) Received: by mail-wm1-f65.google.com with SMTP id n8so6149863wmc.4 for ; Thu, 12 Mar 2020 06:44:23 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:from:to:cc:date:message-id:in-reply-to :references:user-agent:mime-version:content-transfer-encoding; bh=CkT7bc8hamegT/PZ3fRPIRZB1s85fsBvvkAcU7U7HTU=; b=s2nhzikcUv/QQc3BHHaIkZa/uVOrkSj2DGYonlwe+m/4exKtL1tLPNLUezFPLUaynM sNRzdUWLSHQq4v6Udu4kG311Gi3mugg7y6ADn2WJPyT5RYsw5ZxlAJYY2Q93MRxAhzWu 9CkBN9Hj08a6wtL1LhZNEs+jb+ZZzLy0VYikVq3m76Y/HiA+NfLox5AZ4GJwZsnludD0 x4BznRjj1lAgt8tU+AxMCZcSWqpCN5/A5m2vNHCF13dmdfwkZO0F+lNJGagm1P9PColx 5LnRdMc7tDH+jQbr1gr6LpTZbdDb8CK21kOZiKLn6YMsRHLRWbWC6HnP0zjj7XkIBPWY FqmQ== X-Gm-Message-State: ANhLgQ2Ghgy5uWwXBeEwGALjntsN8c23T0kmJ5dTb/VBczk9N2uUjiF2 FYqK54OghwyhlOLjQirGVh0= X-Google-Smtp-Source: ADFU+vvhJSmZLXjC7AIqwFTK6xpllQptyVlgFX1avV8kOqk5oFPn2TznN+Y51yVmutA8/2tc/V1LBA== X-Received: by 2002:a1c:ab04:: with SMTP id u4mr4918728wme.88.1584020662657; Thu, 12 Mar 2020 06:44:22 -0700 (PDT) Received: from [192.168.0.36] (87.78.186.89.cust.ip.kpnqwest.it. [89.186.78.87]) by smtp.gmail.com with ESMTPSA id w16sm42049520wrp.8.2020.03.12.06.44.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 12 Mar 2020 06:44:21 -0700 (PDT) From: Dario Faggioli To: xen-devel@lists.xenproject.org Date: Thu, 12 Mar 2020 14:44:20 +0100 Message-ID: <158402066079.753.15216909367478917806.stgit@Palanthas> In-Reply-To: <158402056376.753.7091379488590272336.stgit@Palanthas> References: <158402056376.753.7091379488590272336.stgit@Palanthas> User-Agent: StGit/0.21 MIME-Version: 1.0 Subject: [Xen-devel] [PATCH 2/2] xen: credit2: fix credit reset happening too few times X-BeenThere: xen-devel@lists.xenproject.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Cc: Juergen Gross , Charles Arnold , Jan Beulich , Glen , George Dunlap , Tomas Mozes , Sarah Newman Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" There is a bug in commit 5e4b4199667b9 ("xen: credit2: only reset credit on reset condition"). In fact, the aim of that commit was to make sure that we do not perform too many credit reset operations (which are not super cheap, and in an hot-path). But the check used to determine whether a reset is necessary was the wrong one. In fact, knowing just that some vCPUs have been skipped, while traversing the runqueue (in runq_candidate()), is not enough. We need to check explicitly whether the first vCPU in the runqueue has a negative amount of credit. Since a trace record is changed, this patch updates xentrace format file and xenalyze as well This should be backported. Signed-off-by: Dario Faggioli --- Cc: George Dunlap Cc: Juergen Gross Cc: Jan Beulich Cc: Charles Arnold Cc: Glen Cc: Tomas Mozes Cc: Sarah Newman --- About the Credit2 stall issue reported recently, and mentioned in patch 1 of this series. This second patch, alone, was already mitigating the issue quite substantially. Still, the proper fix for the issue itself is patch 1, while this is a fix for a bug in the code, introduced with a previous change, which happens to help to cure the sympthoms of the problem at hand. --- tools/xentrace/formats | 2 +- tools/xentrace/xenalyze.c | 8 +++----- xen/common/sched/credit2.c | 30 +++++++++++++----------------- 3 files changed, 17 insertions(+), 23 deletions(-) diff --git a/tools/xentrace/formats b/tools/xentrace/formats index d6e7e3f800..8142d8889b 100644 --- a/tools/xentrace/formats +++ b/tools/xentrace/formats @@ -67,7 +67,7 @@ 0x00022210 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:load_check [ lrq_id[16]:orq_id[16] = 0x%(1)08x, delta = %(2)d ] 0x00022211 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:load_balance [ l_bavgload = 0x%(2)08x%(1)08x, o_bavgload = 0x%(4)08x%(3)08x, lrq_id[16]:orq_id[16] = 0x%(5)08x ] 0x00022212 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:pick_cpu [ b_avgload = 0x%(2)08x%(1)08x, dom:vcpu = 0x%(3)08x, rq_id[16]:new_cpu[16] = %(4)d ] -0x00022213 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:runq_candidate [ dom:vcpu = 0x%(1)08x, credit = %(4)d, skipped_vcpus = %(3)d, tickled_cpu = %(2)d ] +0x00022213 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:runq_candidate [ dom:vcpu = 0x%(1)08x, credit = %(3)d, tickled_cpu = %(2)d ] 0x00022214 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:schedule [ rq:cpu = 0x%(1)08x, tasklet[8]:idle[8]:smt_idle[8]:tickled[8] = %(2)08x ] 0x00022215 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:ratelimit [ dom:vcpu = 0x%(1)08x, runtime = %(2)d ] 0x00022216 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:runq_cand_chk [ dom:vcpu = 0x%(1)08x ] diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c index aa894673ad..e555c572f7 100644 --- a/tools/xentrace/xenalyze.c +++ b/tools/xentrace/xenalyze.c @@ -7856,14 +7856,12 @@ void sched_process(struct pcpu_info *p) if (opt.dump_all) { struct { unsigned vcpuid:16, domid:16; - unsigned tickled_cpu, skipped; + unsigned tickled_cpu; int credit; } *r = (typeof(r))ri->d; - printf(" %s csched2:runq_candidate d%uv%u, credit = %d, " - "%u vcpus skipped, ", - ri->dump_header, r->domid, r->vcpuid, - r->credit, r->skipped); + printf(" %s csched2:runq_candidate d%uv%u, credit = %d, ", + ri->dump_header, r->domid, r->vcpuid, r->credit); if (r->tickled_cpu == (unsigned)-1) printf("no cpu was tickled\n"); else diff --git a/xen/common/sched/credit2.c b/xen/common/sched/credit2.c index 5c0ab9cd05..20692cf8ef 100644 --- a/xen/common/sched/credit2.c +++ b/xen/common/sched/credit2.c @@ -3221,8 +3221,7 @@ csched2_runtime(const struct scheduler *ops, int cpu, static struct csched2_unit * runq_candidate(struct csched2_runqueue_data *rqd, struct csched2_unit *scurr, - int cpu, s_time_t now, - unsigned int *skipped) + int cpu, s_time_t now) { struct list_head *iter, *temp; const struct sched_resource *sr = get_sched_res(cpu); @@ -3230,8 +3229,6 @@ runq_candidate(struct csched2_runqueue_data *rqd, struct csched2_private *prv = csched2_priv(sr->scheduler); bool yield = false, soft_aff_preempt = false; - *skipped = 0; - if ( unlikely(is_idle_unit(scurr->unit)) ) { snext = scurr; @@ -3325,12 +3322,9 @@ runq_candidate(struct csched2_runqueue_data *rqd, (unsigned char *)&d); } - /* Only consider units that are allowed to run on this processor. */ + /* Only consider vcpus that are allowed to run on this processor. */ if ( !cpumask_test_cpu(cpu, svc->unit->cpu_hard_affinity) ) - { - (*skipped)++; continue; - } /* * If an unit is meant to be picked up by another processor, and such @@ -3339,7 +3333,6 @@ runq_candidate(struct csched2_runqueue_data *rqd, if ( svc->tickled_cpu != -1 && svc->tickled_cpu != cpu && cpumask_test_cpu(svc->tickled_cpu, &rqd->tickled) ) { - (*skipped)++; SCHED_STAT_CRANK(deferred_to_tickled_cpu); continue; } @@ -3351,7 +3344,6 @@ runq_candidate(struct csched2_runqueue_data *rqd, if ( sched_unit_master(svc->unit) != cpu && snext->credit + CSCHED2_MIGRATE_RESIST > svc->credit ) { - (*skipped)++; SCHED_STAT_CRANK(migrate_resisted); continue; } @@ -3375,14 +3367,13 @@ runq_candidate(struct csched2_runqueue_data *rqd, { struct { unsigned unit:16, dom:16; - unsigned tickled_cpu, skipped; + unsigned tickled_cpu; int credit; } d; d.dom = snext->unit->domain->domain_id; d.unit = snext->unit->unit_id; d.credit = snext->credit; d.tickled_cpu = snext->tickled_cpu; - d.skipped = *skipped; __trace_var(TRC_CSCHED2_RUNQ_CANDIDATE, 1, sizeof(d), (unsigned char *)&d); @@ -3414,7 +3405,6 @@ static void csched2_schedule( struct csched2_runqueue_data *rqd; struct csched2_unit * const scurr = csched2_unit(currunit); struct csched2_unit *snext = NULL; - unsigned int skipped_units = 0; bool tickled; bool migrated = false; @@ -3492,7 +3482,7 @@ static void csched2_schedule( snext = csched2_unit(sched_idle_unit(sched_cpu)); } else - snext = runq_candidate(rqd, scurr, sched_cpu, now, &skipped_units); + snext = runq_candidate(rqd, scurr, sched_cpu, now); /* If switching from a non-idle runnable unit, put it * back on the runqueue. */ @@ -3504,6 +3494,8 @@ static void csched2_schedule( /* Accounting for non-idle tasks */ if ( !is_idle_unit(snext->unit) ) { + int top_credit; + /* If switching, remove this from the runqueue and mark it scheduled */ if ( snext != scurr ) { @@ -3531,11 +3523,15 @@ static void csched2_schedule( * 2) no other unit with higher credits wants to run. * * Here, where we want to check for reset, we need to make sure the - * proper unit is being used. In fact, runqueue_candidate() may have - * not returned the first unit in the runqueue, for various reasons + * proper unit is being used. In fact, runq_candidate() may have not + * returned the first unit in the runqueue, for various reasons * (e.g., affinity). Only trigger a reset when it does. */ - if ( skipped_units == 0 && snext->credit <= CSCHED2_CREDIT_RESET ) + if ( list_empty(&rqd->runq) ) + top_credit = snext->credit; + else + top_credit = max(snext->credit, runq_elem(rqd->runq.next)->credit); + if ( top_credit <= CSCHED2_CREDIT_RESET ) { reset_credit(sched_cpu, now, snext); balance_load(ops, sched_cpu, now);