From patchwork Wed Jul 26 14:39:50 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Olaf Hering X-Patchwork-Id: 9865187 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 15EDC6038C for ; Wed, 26 Jul 2017 14:42:32 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0B6732871C for ; Wed, 26 Jul 2017 14:42:32 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id F41D028752; Wed, 26 Jul 2017 14:42:31 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.4 required=2.0 tests=BAYES_00,DKIM_SIGNED, GUARANTEED_100_PERCENT, RCVD_IN_DNSWL_MED, T_DKIM_INVALID autolearn=no version=3.3.1 Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) (using TLSv1.2 with cipher AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 3C2E92871C for ; Wed, 26 Jul 2017 14:42:30 +0000 (UTC) Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.84_2) (envelope-from ) id 1daNU2-00057x-NJ; Wed, 26 Jul 2017 14:40:18 +0000 Received: from mail6.bemta6.messagelabs.com ([193.109.254.103]) by lists.xenproject.org with esmtp (Exim 4.84_2) (envelope-from ) id 1daNU0-00057H-Uy for xen-devel@lists.xen.org; Wed, 26 Jul 2017 14:40:17 +0000 Received: from [85.158.143.35] by server-10.bemta-6.messagelabs.com id 10/04-03582-0D9A8795; Wed, 26 Jul 2017 14:40:16 +0000 X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFvrBLMWRWlGSWpSXmKPExsUSuHLSAt3zKys iDR7cl7FY8nExiwOjx9Hdv5kCGKNYM/OS8isSWDMWbVnBVHApoaJl2lzmBsb/fl2MXBwsAr+Z JN70NTB3MXJySAjkSszpncvaxcgBZItIPPmfBhIWEjjEJNH7qB7EZhNQkth78DgjiC0ikCoxY 2o3C4jNLKAg8eL5ViYQW1jASeJJSxMbiM0ioCrR/2QXWD2vgLHEqadn2SFWyUu8638KVs8pYC Ixafd7dohdxhIrrz1jmcDIu4CRYRWjenFqUVlqka6JXlJRZnpGSW5iZo6uoYGZXm5qcXFiemp OYlKxXnJ+7iZGYCgwAMEOxu7L/ocYJTmYlER5J5lWRArxJeWnVGYkFmfEF5XmpBYfYpTh4FCS 4P2+AignWJSanlqRlpkDDEqYtAQHj5IIryUwMIV4iwsSc4sz0yFSpxh1OV5N+P+NSYglLz8vV Uqc9x3IDAGQoozSPLgRsAi5xCgrJczLCHSUEE9BalFuZgmq/CtGcQ5GJWHehSBTeDLzSuA2vQ I6ggnoiDkzSkGOKElESEk1MJp/mrbHcfPfOZM/snkF9oiu32iqeuCCY1p/bv/qH1LcL7Z1mLc t/+iiPJ334pHZ25SuzHufncLTnWrnN+3555nXrD2WvjTTlvTRlKtrXmzFOJunVddpeeppLVXz 1kbbze+d08vDq3+fs+Fcm/jzSyTTPK278ioaHzd9TomvT1Lwbb3AILFuuRJLcUaioRZzUXEiA AoOg2eLAgAA X-Env-Sender: olaf@aepfle.de X-Msg-Ref: server-2.tower-21.messagelabs.com!1501080015!61124426!1 X-Originating-IP: [81.169.146.160] X-SpamReason: No, hits=1.3 required=7.0 tests=sa_preprocessor: QmFkIElQOiA4MS4xNjkuMTQ2LjE2MCA9PiA1NTc3MTg=\n,sa_preprocessor: QmFkIElQOiA4MS4xNjkuMTQ2LjE2MCA9PiA1NTc3MTg=\n,BODY_RANDOM_LONG, GUARANTEED_100_PERCENT X-StarScan-Received: X-StarScan-Version: 9.4.25; banners=-,-,- X-VirusChecked: Checked Received: (qmail 59719 invoked from network); 26 Jul 2017 14:40:15 -0000 Received: from mo4-p00-ob.smtp.rzone.de (HELO mo4-p00-ob.smtp.rzone.de) (81.169.146.160) by server-2.tower-21.messagelabs.com with DHE-RSA-AES256-GCM-SHA384 encrypted SMTP; 26 Jul 2017 14:40:15 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; t=1501080015; l=12564; s=domk; d=aepfle.de; h=References:In-Reply-To:Date:Subject:Cc:To:From; bh=b7eERJoUT8cm5mZhQB2z1sWhjChKLVc09U4aTpJk1/4=; b=Y5Wxe0U6nvoVGPO0eXRCGXuH0THk5aFDmmVgOrnXkEqdCVU/yUfx+FH1FigEg7yCE2 DMXB6KTXfoXhzCOzsaAUBsRPf64CX8suW8FBC2lHzwqxA+Xu4llQvXJrfvgqfjRRLLS/ wijwtVUyQoHiOEvJGntVblvROzUBOP8aIsd9c= X-RZG-AUTH: :P2EQZWCpfu+qG7CngxMFH1J+yackYocTD1iAi8x+OWi/zfN1cLnAYQz4nTxeMfYqQUynrTNSUxxRmo+kS0vrvFOiwqvPcA== X-RZG-CLASS-ID: mo00 Received: from sender ([2001:a61:3458:10ff:1629:d398:f8f9:5e72]) by smtp.strato.de (RZmta 41.1 AUTH) with ESMTPSA id v056act6QEe2CuC (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA (curve secp521r1 with 521 ECDH bits, eq. 15360 bits RSA)) (Client did not present a certificate); Wed, 26 Jul 2017 16:40:02 +0200 (CEST) From: Olaf Hering To: xen-devel@lists.xen.org, Ian Jackson , Wei Liu Date: Wed, 26 Jul 2017 16:39:50 +0200 Message-Id: <20170726143950.30329-4-olaf@aepfle.de> X-Mailer: git-send-email 2.13.2 In-Reply-To: <20170726143950.30329-1-olaf@aepfle.de> References: <20170726143950.30329-1-olaf@aepfle.de> Cc: Olaf Hering Subject: [Xen-devel] [PATCH v3 3/3] docs: add pod variant of xl-numa-placement X-BeenThere: xen-devel@lists.xen.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" X-Virus-Scanned: ClamAV using ClamSMTP Convert source for xl-numa-placement.7 from markdown to pod. This removes the buildtime requirement for pandoc, and subsequently the need for ghc, in the chain for BuildRequires of xen.rpm. Signed-off-by: Olaf Hering Reviewed-by: Dario Faggioli --- ...lacement.markdown.7 => xl-numa-placement.pod.7} | 166 ++++++++++++++------- 1 file changed, 110 insertions(+), 56 deletions(-) rename docs/man/{xl-numa-placement.markdown.7 => xl-numa-placement.pod.7} (74%) diff --git a/docs/man/xl-numa-placement.markdown.7 b/docs/man/xl-numa-placement.pod.7 similarity index 74% rename from docs/man/xl-numa-placement.markdown.7 rename to docs/man/xl-numa-placement.pod.7 index f863492093..54a444172e 100644 --- a/docs/man/xl-numa-placement.markdown.7 +++ b/docs/man/xl-numa-placement.pod.7 @@ -1,6 +1,12 @@ -# Guest Automatic NUMA Placement in libxl and xl # +=encoding utf8 -## Rationale ## +=head1 NAME + +Guest Automatic NUMA Placement in libxl and xl + +=head1 DESCRIPTION + +=head2 Rationale NUMA (which stands for Non-Uniform Memory Access) means that the memory accessing times of a program running on a CPU depends on the relative @@ -17,13 +23,14 @@ running memory-intensive workloads on a shared host. In fact, the cost of accessing non node-local memory locations is very high, and the performance degradation is likely to be noticeable. -For more information, have a look at the [Xen NUMA Introduction][numa_intro] +For more information, have a look at the L page on the Wiki. -## Xen and NUMA machines: the concept of _node-affinity_ ## + +=head2 Xen and NUMA machines: the concept of I The Xen hypervisor deals with NUMA machines throughout the concept of -_node-affinity_. The node-affinity of a domain is the set of NUMA nodes +I. The node-affinity of a domain is the set of NUMA nodes of the host where the memory for the domain is being allocated (mostly, at domain creation time). This is, at least in principle, different and unrelated with the vCPU (hard and soft, see below) scheduling affinity, @@ -42,15 +49,16 @@ it is very important to "place" the domain correctly when it is fist created, as the most of its memory is allocated at that time and can not (for now) be moved easily. -### Placing via pinning and cpupools ### + +=head2 Placing via pinning and cpupools The simplest way of placing a domain on a NUMA node is setting the hard scheduling affinity of the domain's vCPUs to the pCPUs of the node. This also goes under the name of vCPU pinning, and can be done through the "cpus=" option in the config file (more about this below). Another option is to pool together the pCPUs spanning the node and put the domain in -such a _cpupool_ with the "pool=" config option (as documented in our -[Wiki][cpupools_howto]). +such a I with the "pool=" config option (as documented in our +L). In both the above cases, the domain will not be able to execute outside the specified set of pCPUs for any reasons, even if all those pCPUs are @@ -59,7 +67,8 @@ busy doing something else while there are others, idle, pCPUs. So, when doing this, local memory accesses are 100% guaranteed, but that may come at he cost of some load imbalances. -### NUMA aware scheduling ### + +=head2 NUMA aware scheduling If using the credit1 scheduler, and starting from Xen 4.3, the scheduler itself always tries to run the domain's vCPUs on one of the nodes in @@ -87,21 +96,37 @@ workload. Notice that, for each vCPU, the following three scenarios are possbile: - * a vCPU *is pinned* to some pCPUs and *does not have* any soft affinity - In this case, the vCPU is always scheduled on one of the pCPUs to which - it is pinned, without any specific peference among them. - * a vCPU *has* its own soft affinity and *is not* pinned to any particular - pCPU. In this case, the vCPU can run on every pCPU. Nevertheless, the - scheduler will try to have it running on one of the pCPUs in its soft - affinity; - * a vCPU *has* its own vCPU soft affinity and *is also* pinned to some - pCPUs. In this case, the vCPU is always scheduled on one of the pCPUs - onto which it is pinned, with, among them, a preference for the ones - that also forms its soft affinity. In case pinning and soft affinity - form two disjoint sets of pCPUs, pinning "wins", and the soft affinity - is just ignored. - -## Guest placement in xl ## +=over + +=item * + +a vCPU I to some pCPUs and I any soft affinity +In this case, the vCPU is always scheduled on one of the pCPUs to which +it is pinned, without any specific peference among them. + + +=item * + +a vCPU I its own soft affinity and I pinned to any particular +pCPU. In this case, the vCPU can run on every pCPU. Nevertheless, the +scheduler will try to have it running on one of the pCPUs in its soft +affinity; + + +=item * + +a vCPU I its own vCPU soft affinity and I pinned to some +pCPUs. In this case, the vCPU is always scheduled on one of the pCPUs +onto which it is pinned, with, among them, a preference for the ones +that also forms its soft affinity. In case pinning and soft affinity +form two disjoint sets of pCPUs, pinning "wins", and the soft affinity +is just ignored. + + +=back + + +=head2 Guest placement in xl If using xl for creating and managing guests, it is very easy to ask for both manual or automatic placement of them across the host's NUMA nodes. @@ -111,7 +136,8 @@ the details of the heuristics adopted for automatic placement (see below), and the lack of support (in both xm/xend and the Xen versions where that was the default toolstack) for NUMA aware scheduling. -### Placing the guest manually ### + +=head2 Placing the guest manually Thanks to the "cpus=" option, it is possible to specify where a domain should be created and scheduled on, directly in its config file. This @@ -126,19 +152,31 @@ or Xen won't be able to guarantee the locality for their memory accesses. That, of course, also mean the vCPUs of the domain will only be able to execute on those same pCPUs. -It is is also possible to have a "cpus\_soft=" option in the xl config file, +It is is also possible to have a "cpus_soft=" option in the xl config file, to specify the soft affinity for all the vCPUs of the domain. This affects the NUMA placement in the following way: - * if only "cpus\_soft=" is present, the VM's node-affinity will be equal - to the nodes to which the pCPUs in the soft affinity mask belong; - * if both "cpus\_soft=" and "cpus=" are present, the VM's node-affinity - will be equal to the nodes to which the pCPUs present both in hard and - soft affinity belong. +=over + +=item * + +if only "cpus_soft=" is present, the VM's node-affinity will be equal +to the nodes to which the pCPUs in the soft affinity mask belong; -### Placing the guest automatically ### -If neither "cpus=" nor "cpus\_soft=" are present in the config file, libxl +=item * + +if both "cpus_soft=" and "cpus=" are present, the VM's node-affinity +will be equal to the nodes to which the pCPUs present both in hard and +soft affinity belong. + + +=back + + +=head2 Placing the guest automatically + +If neither "cpus=" nor "cpus_soft=" are present in the config file, libxl tries to figure out on its own on which node(s) the domain could fit best. If it finds one (some), the domain's node affinity get set to there, and both memory allocations and NUMA aware scheduling (for the credit @@ -160,14 +198,29 @@ to have, and as much pCPUs as it has vCPUs. After that, the actual decision on which candidate to pick happens accordingly to the following heuristics: - * candidates involving fewer nodes are considered better. In case - two (or more) candidates span the same number of nodes, - * candidates with a smaller number of vCPUs runnable on them (due - to previous placement and/or plain vCPU pinning) are considered - better. In case the same number of vCPUs can run on two (or more) - candidates, - * the candidate with with the greatest amount of free memory is - considered to be the best one. +=over + +=item * + +candidates involving fewer nodes are considered better. In case +two (or more) candidates span the same number of nodes, + + +=item * + +candidates with a smaller number of vCPUs runnable on them (due +to previous placement and/or plain vCPU pinning) are considered +better. In case the same number of vCPUs can run on two (or more) +candidates, + + +=item * + +the candidate with with the greatest amount of free memory is +considered to be the best one. + + +=back Giving preference to candidates with fewer nodes ensures better performance for the guest, as it avoid spreading its memory among @@ -178,35 +231,37 @@ largest amounts of free memory helps keeping the memory fragmentation small, and maximizes the probability of being able to put more domains there. -## Guest placement in libxl ## + +=head2 Guest placement in libxl xl achieves automatic NUMA placement because that is what libxl does by default. No API is provided (yet) for modifying the behaviour of the placement algorithm. However, if your program is calling libxl, -it is possible to set the `numa_placement` build info key to `false` -(it is `true` by default) with something like the below, to prevent +it is possible to set the C build info key to C +(it is C by default) with something like the below, to prevent any placement from happening: libxl_defbool_set(&domain_build_info->numa_placement, false); -Also, if `numa_placement` is set to `true`, the domain's vCPUs must -not be pinned (i.e., `domain_build_info->cpumap` must have all its +Also, if C is set to C, the domain's vCPUs must +not be pinned (i.e., C<<< domain_build_info->cpumap >>> must have all its bits set, as it is by default), or domain creation will fail with -`ERROR_INVAL`. +C. Starting from Xen 4.3, in case automatic placement happens (and is -successful), it will affect the domain's node-affinity and _not_ its +successful), it will affect the domain's node-affinity and I its vCPU pinning. Namely, the domain's vCPUs will not be pinned to any pCPU on the host, but the memory from the domain will come from the selected node(s) and the NUMA aware scheduling (if the credit scheduler is in use) will try to keep the domain's vCPUs there as much as possible. Besides than that, looking and/or tweaking the placement algorithm -search "Automatic NUMA placement" in libxl\_internal.h. +search "Automatic NUMA placement" in libxl_internal.h. Note this may change in future versions of Xen/libxl. -## Xen < 4.5 ## + +=head2 Xen < 4.5 The concept of vCPU soft affinity has been introduced for the first time in Xen 4.5. In 4.3, it is the domain's node-affinity that drives the @@ -215,25 +270,24 @@ and so each vCPU can have its own mask of pCPUs, while node-affinity is per-domain, that is the equivalent of having all the vCPUs with the same soft affinity. -## Xen < 4.3 ## + +=head2 Xen < 4.3 As NUMA aware scheduling is a new feature of Xen 4.3, things are a little bit different for earlier version of Xen. If no "cpus=" option is specified and Xen 4.2 is in use, the automatic placement algorithm still runs, but -the results is used to _pin_ the vCPUs of the domain to the output node(s). +the results is used to I the vCPUs of the domain to the output node(s). This is consistent with what was happening with xm/xend. On a version of Xen earlier than 4.2, there is not automatic placement at all in xl or libxl, and hence no node-affinity, vCPU affinity or pinning being introduced/modified. -## Limitations ## + +=head2 Limitations Analyzing various possible placement solutions is what makes the algorithm flexible and quite effective. However, that also means it won't scale well to systems with arbitrary number of nodes. For this reason, automatic placement is disabled (with a warning) if it is requested on a host with more than 16 NUMA nodes. - -[numa_intro]: http://wiki.xen.org/wiki/Xen_NUMA_Introduction -[cpupools_howto]: http://wiki.xen.org/wiki/Cpupools_Howto