From patchwork Thu Jul 5 16:49:40 2012 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chegu Vinod X-Patchwork-Id: 1161751 Return-Path: X-Original-To: patchwork-kvm@patchwork.kernel.org Delivered-To: patchwork-process-083081@patchwork1.kernel.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by patchwork1.kernel.org (Postfix) with ESMTP id 643EB3FD4F for ; Thu, 5 Jul 2012 16:50:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933338Ab2GEQuK (ORCPT ); Thu, 5 Jul 2012 12:50:10 -0400 Received: from g1t0026.austin.hp.com ([15.216.28.33]:21546 "EHLO g1t0026.austin.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752075Ab2GEQuI (ORCPT ); Thu, 5 Jul 2012 12:50:08 -0400 Received: from g1t0039.austin.hp.com (g1t0039.austin.hp.com [16.236.32.45]) by g1t0026.austin.hp.com (Postfix) with ESMTP id 78137C042; Thu, 5 Jul 2012 16:50:07 +0000 (UTC) Received: from essex.cup.hp.com (essex.cup.hp.com [16.89.244.85]) by g1t0039.austin.hp.com (Postfix) with ESMTP id F2A1A34087; Thu, 5 Jul 2012 16:50:06 +0000 (UTC) Received: from essex.cup.hp.com (localhost [127.0.0.1]) by essex.cup.hp.com (8.13.8/8.13.8) with ESMTP id q65Go4xu031176; Thu, 5 Jul 2012 09:50:05 -0700 Received: (from vinod@localhost) by essex.cup.hp.com (8.13.8/8.13.8/Submit) id q65Go3cA031175; Thu, 5 Jul 2012 09:50:03 -0700 From: Chegu Vinod To: qemu-devel@nongnu.org Cc: kvm@vger.kernel.org, Chegu Vinod , Jim Hull , Craig Hada Subject: [PATCH v3] Fixes related to processing of qemu's -numa option Date: Thu, 5 Jul 2012 09:49:40 -0700 Message-Id: <1341506980-31145-1-git-send-email-chegu_vinod@hp.com> X-Mailer: git-send-email 1.7.8 In-Reply-To: References: Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Changes since v2: - Using "unsigned long *" for the node_cpumask[]. - Use bitmap_new() instead of g_malloc0() for allocation. - Don't rely on "max_cpus" since it may not be initialized before the numa related qemu options are parsed & processed. Note: Continuing to use a new constant for allocation of the mask (This constant is currently set to 255 since with an 8bit APIC ID VCPUs can range from 0-254 in a guest. The APIC ID 255 (0xFF) is reserved for broadcast). Changes since v1: ---------------- - Use bitmap functions that are already in qemu (instead of cpu_set_t macro's from sched.h) - Added a check for endvalue >= max_cpus. - Fix to address the round-robbing assignment when cpu's are not explicitly specified. Tested-by: Eduardo Habkost redhat.com> ----------------------------------------------- v1: Tested-by: Eduardo Habkost Reviewed-by: Eduardo Habkost --- The -numa option to qemu is used to create [fake] numa nodes and expose them to the guest OS instance. There are a couple of issues with the -numa option: a) Max VCPU's that can be specified for a guest while using the qemu's -numa option is 64. Due to a typecasting issue when the number of VCPUs is > 32 the VCPUs don't show up under the specified [fake] numa nodes. b) KVM currently has support for 160VCPUs per guest. The qemu's -numa option has only support for upto 64VCPUs per guest. This patch addresses these two issues. Below are examples of (a) and (b) a) >32 VCPUs are specified with the -numa option: /usr/local/bin/qemu-system-x86_64 \ -enable-kvm \ 71:01:01 \ -net tap,ifname=tap0,script=no,downscript=no \ -vnc :4 ... Upstream qemu : -------------- QEMU 1.1.50 monitor - type 'help' for more information (qemu) info numa 6 nodes node 0 cpus: 0 1 2 3 4 5 6 7 8 9 32 33 34 35 36 37 38 39 40 41 node 0 size: 131072 MB node 1 cpus: 10 11 12 13 14 15 16 17 18 19 42 43 44 45 46 47 48 49 50 51 node 1 size: 131072 MB node 2 cpus: 20 21 22 23 24 25 26 27 28 29 52 53 54 55 56 57 58 59 node 2 size: 131072 MB node 3 cpus: 30 node 3 size: 131072 MB node 4 cpus: node 4 size: 131072 MB node 5 cpus: 31 node 5 size: 131072 MB With the patch applied : ----------------------- QEMU 1.1.50 monitor - type 'help' for more information (qemu) info numa 6 nodes node 0 cpus: 0 1 2 3 4 5 6 7 8 9 node 0 size: 131072 MB node 1 cpus: 10 11 12 13 14 15 16 17 18 19 node 1 size: 131072 MB node 2 cpus: 20 21 22 23 24 25 26 27 28 29 node 2 size: 131072 MB node 3 cpus: 30 31 32 33 34 35 36 37 38 39 node 3 size: 131072 MB node 4 cpus: 40 41 42 43 44 45 46 47 48 49 node 4 size: 131072 MB node 5 cpus: 50 51 52 53 54 55 56 57 58 59 node 5 size: 131072 MB b) >64 VCPUs specified with -numa option: /usr/local/bin/qemu-system-x86_64 \ -enable-kvm \ -cpu Westmere,+rdtscp,+pdpe1gb,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pclmuldq,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme \ -smp sockets=8,cores=10,threads=1 \ -numa node,nodeid=0,cpus=0-9,mem=64g \ -numa node,nodeid=1,cpus=10-19,mem=64g \ -numa node,nodeid=2,cpus=20-29,mem=64g \ -numa node,nodeid=3,cpus=30-39,mem=64g \ -numa node,nodeid=4,cpus=40-49,mem=64g \ -numa node,nodeid=5,cpus=50-59,mem=64g \ -numa node,nodeid=6,cpus=60-69,mem=64g \ -numa node,nodeid=7,cpus=70-79,mem=64g \ -m 524288 \ -name vm1 \ -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait \ -drive file=/dev/libvirt_lvm/vm.img,if=none,id=drive-virtio-disk0,format=raw,cache=none,aio=native \ -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \ -monitor stdio \ -net nic,macaddr=52:54:00:71:01:01 \ -net tap,ifname=tap0,script=no,downscript=no \ -vnc :4 ... Upstream qemu : -------------- only 63 CPUs in NUMA mode supported. only 64 CPUs in NUMA mode supported. QEMU 1.1.50 monitor - type 'help' for more information (qemu) info numa 8 nodes node 0 cpus: 6 7 8 9 38 39 40 41 70 71 72 73 node 0 size: 65536 MB node 1 cpus: 10 11 12 13 14 15 16 17 18 19 42 43 44 45 46 47 48 49 50 51 74 75 76 77 78 79 node 1 size: 65536 MB node 2 cpus: 20 21 22 23 24 25 26 27 28 29 52 53 54 55 56 57 58 59 60 61 node 2 size: 65536 MB node 3 cpus: 30 62 node 3 size: 65536 MB node 4 cpus: node 4 size: 65536 MB node 5 cpus: node 5 size: 65536 MB node 6 cpus: 31 63 node 6 size: 65536 MB node 7 cpus: 0 1 2 3 4 5 32 33 34 35 36 37 64 65 66 67 68 69 node 7 size: 65536 MB With the patch applied : ----------------------- QEMU 1.1.50 monitor - type 'help' for more information (qemu) info numa 8 nodes node 0 cpus: 0 1 2 3 4 5 6 7 8 9 node 0 size: 65536 MB node 1 cpus: 10 11 12 13 14 15 16 17 18 19 node 1 size: 65536 MB node 2 cpus: 20 21 22 23 24 25 26 27 28 29 node 2 size: 65536 MB node 3 cpus: 30 31 32 33 34 35 36 37 38 39 node 3 size: 65536 MB node 4 cpus: 40 41 42 43 44 45 46 47 48 49 node 4 size: 65536 MB node 5 cpus: 50 51 52 53 54 55 56 57 58 59 node 5 size: 65536 MB node 6 cpus: 60 61 62 63 64 65 66 67 68 69 node 6 size: 65536 MB node 7 cpus: 70 71 72 73 74 75 76 77 78 79 Signed-off-by: Chegu Vinod , Jim Hull , Craig Hada Tested-by: Eduardo Habkost redhat.com> --- cpus.c | 3 ++- hw/pc.c | 3 ++- sysemu.h | 3 ++- vl.c | 48 ++++++++++++++++++++++++++---------------------- 4 files changed, 32 insertions(+), 25 deletions(-) diff --git a/cpus.c b/cpus.c index b182b3d..acccd08 100644 --- a/cpus.c +++ b/cpus.c @@ -36,6 +36,7 @@ #include "cpus.h" #include "qtest.h" #include "main-loop.h" +#include "bitmap.h" #ifndef _WIN32 #include "compatfd.h" @@ -1145,7 +1146,7 @@ void set_numa_modes(void) for (env = first_cpu; env != NULL; env = env->next_cpu) { for (i = 0; i < nb_numa_nodes; i++) { - if (node_cpumask[i] & (1 << env->cpu_index)) { + if (test_bit(env->cpu_index, node_cpumask[i])) { env->numa_node = i; } } diff --git a/hw/pc.c b/hw/pc.c index c7e9ab3..2edcc07 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -48,6 +48,7 @@ #include "memory.h" #include "exec-memory.h" #include "arch_init.h" +#include "bitmap.h" /* output Bochs bios info messages */ //#define DEBUG_BIOS @@ -639,7 +640,7 @@ static void *bochs_bios_init(void) numa_fw_cfg[0] = cpu_to_le64(nb_numa_nodes); for (i = 0; i < max_cpus; i++) { for (j = 0; j < nb_numa_nodes; j++) { - if (node_cpumask[j] & (1 << i)) { + if (test_bit(i, node_cpumask[j])) { numa_fw_cfg[i + 1] = cpu_to_le64(j); break; } diff --git a/sysemu.h b/sysemu.h index bc2c788..2ce63fc 100644 --- a/sysemu.h +++ b/sysemu.h @@ -133,9 +133,10 @@ extern uint8_t qemu_extra_params_fw[2]; extern QEMUClock *rtc_clock; #define MAX_NODES 64 +#define MAX_CPUMASK_BITS 255 extern int nb_numa_nodes; extern uint64_t node_mem[MAX_NODES]; -extern uint64_t node_cpumask[MAX_NODES]; +extern unsigned long *node_cpumask[MAX_NODES]; #define MAX_OPTION_ROMS 16 typedef struct QEMUOptionRom { diff --git a/vl.c b/vl.c index 1329c30..fdd7b74 100644 --- a/vl.c +++ b/vl.c @@ -28,6 +28,7 @@ #include #include #include +#include "bitmap.h" /* Needed early for CONFIG_BSD etc. */ #include "config-host.h" @@ -240,7 +241,7 @@ QTAILQ_HEAD(, FWBootEntry) fw_boot_order = QTAILQ_HEAD_INITIALIZER(fw_boot_order int nb_numa_nodes; uint64_t node_mem[MAX_NODES]; -uint64_t node_cpumask[MAX_NODES]; +unsigned long *node_cpumask[MAX_NODES]; uint8_t qemu_uuid[16]; @@ -950,6 +951,9 @@ static void numa_add(const char *optarg) char *endptr; unsigned long long value, endvalue; int nodenr; + int i; + + value = endvalue = 0ULL; optarg = get_opt_name(option, 128, optarg, ',') + 1; if (!strcmp(option, "node")) { @@ -970,27 +974,25 @@ static void numa_add(const char *optarg) } node_mem[nodenr] = sval; } - if (get_param_value(option, 128, "cpus", optarg) == 0) { - node_cpumask[nodenr] = 0; - } else { + if (get_param_value(option, 128, "cpus", optarg) != 0) { value = strtoull(option, &endptr, 10); - if (value >= 64) { - value = 63; - fprintf(stderr, "only 64 CPUs in NUMA mode supported.\n"); + if (*endptr == '-') { + endvalue = strtoull(endptr+1, &endptr, 10); } else { - if (*endptr == '-') { - endvalue = strtoull(endptr+1, &endptr, 10); - if (endvalue >= 63) { - endvalue = 62; - fprintf(stderr, - "only 63 CPUs in NUMA mode supported.\n"); - } - value = (2ULL << endvalue) - (1ULL << value); - } else { - value = 1ULL << value; - } + endvalue = value; + } + + + if (!(endvalue < MAX_CPUMASK_BITS)) { + endvalue = MAX_CPUMASK_BITS - 1; + fprintf(stderr, + "A max of %d CPUs are supported in a guest\n", + MAX_CPUMASK_BITS); + } + + for (i = value; i <= endvalue; ++i) { + set_bit(i, node_cpumask[nodenr]); } - node_cpumask[nodenr] = value; } nb_numa_nodes++; } @@ -2331,7 +2333,8 @@ int main(int argc, char **argv, char **envp) for (i = 0; i < MAX_NODES; i++) { node_mem[i] = 0; - node_cpumask[i] = 0; + node_cpumask[i] = bitmap_new(MAX_CPUMASK_BITS); + bitmap_zero(node_cpumask[i], MAX_CPUMASK_BITS); } nb_numa_nodes = 0; @@ -3469,8 +3472,9 @@ int main(int argc, char **argv, char **envp) } for (i = 0; i < nb_numa_nodes; i++) { - if (node_cpumask[i] != 0) + if (!bitmap_empty(node_cpumask[i], MAX_CPUMASK_BITS)) { break; + } } /* assigning the VCPUs round-robin is easier to implement, guest OSes * must cope with this anyway, because there are BIOSes out there in @@ -3478,7 +3482,7 @@ int main(int argc, char **argv, char **envp) */ if (i == nb_numa_nodes) { for (i = 0; i < max_cpus; i++) { - node_cpumask[i % nb_numa_nodes] |= 1 << i; + set_bit(i, node_cpumask[i % nb_numa_nodes]); } } }