Message ID | 20220920091218.1208658-3-wei.chen@arm.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Device tree based NUMA support for Arm - Part#2 | expand |
On 20.09.2022 11:12, Wei Chen wrote: > There are some codes in x86/numa.c can be shared by common > architectures to implememnt NUMA support. Just like some > variables and functions to check and store NUMA memory map. > And some variables and functions to do NUMA initialization. > > In this patch, we move them to common/numa.c and xen/numa.h > and use the CONFIG_NUMA to gate them for non-NUMA supported > architectures. As the target header file is Xen-style, so > we trim some spaces and replace tabs for the codes that has > been moved to xen/numa.h at the same time. > > As acpi_scan_nodes has been used in a common function, it > doesn't make sense to use acpi_xxx in common code, so we > rename it to numa_scan_nodes in this patch too. After that numa_process_nodes() now? > if we still use CONFIG_ACPI_NUMA in to gate numa_scan_nodes > in numa_initmem_init, that doesn't make sense. As CONFIG_NUMA > will be selected by CONFIG_ACPI_NUMA for x86. So, we replace > CONFIG_ACPI_NUMA by CONFIG_NUMA to gate numa_scan_nodes. > > As arch_numa_disabled has been implememnted for ACPI NUMA, > we can rename srat_disabled to numa_disabled and move it > to common code as well. Please update the description: arch_numa_disabled() appears in patch 5 only. Of course if you follow the comments to patch 2, the wording here would be correct again. > +static unsigned int __init extract_lsb_from_nodes(const struct node *nodes, > + nodeid_t numnodes) > +{ > + unsigned int i, nodes_used = 0; > + unsigned long spdx, epdx; > + unsigned long bitfield = 0, memtop = 0; > + > + for ( i = 0; i < numnodes; i++ ) > + { > + spdx = paddr_to_pdx(nodes[i].start); > + epdx = paddr_to_pdx(nodes[i].end - 1) + 1; > + > + if ( spdx >= epdx ) > + continue; > + > + bitfield |= spdx; Perhaps to be taken care of in a separate patch: We accumulate only the bits from the start addresses here, contrary to what the comment ahead of the function says (and I think it is the comment which is putting things correctly). > + nodes_used++; > + if ( epdx > memtop ) > + memtop = epdx; > + } > + > + if ( nodes_used <= 1 ) > + i = BITS_PER_LONG - 1; Is this actually going to be correct for all architectures? Aiui Arm64 has only up to 48 physical address bits, but what about an architecture allowing the use of all 64 bits? I think at the very least we want BUILD_BUG_ON(PADDR_BITS >= BITS_PER_LONG) here. > + else > + i = find_first_bit(&bitfield, sizeof(unsigned long) * 8); > + > + memnodemapsize = (memtop >> i) + 1; Again perhaps the subject of a separate patch: Isn't there an off-by-1 mistake here? memtop is the maximum of all epdx-es, which are calculated to be the first PDX following the region. Hence I'd expect memnodemapsize = ((memtop - 1) >> i) + 1; here. I guess I'll make patches for both issues, which you may then need to re-base over. > +static int __init cf_check numa_setup(const char *opt) > +{ > + if ( !strncmp(opt, "off", 3) ) > + numa_off = true; > + else if ( !strncmp(opt, "on", 2) ) > + numa_off = false; > +#ifdef CONFIG_NUMA_EMU > + else if ( !strncmp(opt, "fake=", 5) ) > + { > + numa_off = false; > + numa_fake = simple_strtoul(opt + 5, NULL, 0); > + if ( numa_fake >= MAX_NUMNODES ) > + numa_fake = MAX_NUMNODES; > + } > +#endif > + else > + return arch_numa_setup(opt); > + > + return 0; > +} > +custom_param("numa", numa_setup); Note that with this moved here at some point during your work (when allowing NUMA=y for Arm) you'll need to update the command line doc. > +static void cf_check dump_numa(unsigned char key) > +{ > + s_time_t now = NOW(); > + unsigned int i, j, n; > + struct domain *d; > + const struct page_info *page; > + unsigned int page_num_node[MAX_NUMNODES]; > + const struct vnuma_info *vnuma; > + > + printk("'%c' pressed -> dumping numa info (now = %"PRI_stime")\n", key, > + now); > + > + for_each_online_node ( i ) > + { > + paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1); > + > + printk("NODE%u start->%lu size->%lu free->%lu\n", > + i, node_start_pfn(i), node_spanned_pages(i), > + avail_node_heap_pages(i)); > + /* Sanity check phys_to_nid() */ > + if ( phys_to_nid(pa) != i ) > + printk("phys_to_nid(%"PRIpaddr") -> %d should be %u\n", > + pa, phys_to_nid(pa), i); > + } > + > + j = cpumask_first(&cpu_online_map); > + n = 0; > + for_each_online_cpu ( i ) > + { > + if ( i != j + n || cpu_to_node[j] != cpu_to_node[i] ) > + { > + if ( n > 1 ) > + printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]); > + else > + printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]); > + j = i; > + n = 1; > + } > + else > + ++n; > + } > + if ( n > 1 ) > + printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]); > + else > + printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]); > + > + rcu_read_lock(&domlist_read_lock); > + > + printk("Memory location of each domain:\n"); > + for_each_domain ( d ) > + { > + process_pending_softirqs(); > + > + printk("Domain %u (total: %u):\n", d->domain_id, domain_tot_pages(d)); > + > + for_each_online_node ( i ) > + page_num_node[i] = 0; I'd be inclined to suggest to use memset() here, but I won't insist on you doing this "on the fly". Along with this would likely go the request to limit the scope of page_num_node[] (and then perhaps also vnuma and page). > + spin_lock(&d->page_alloc_lock); > + page_list_for_each ( page, &d->page_list ) > + { > + i = phys_to_nid(page_to_maddr(page)); > + page_num_node[i]++; > + } > + spin_unlock(&d->page_alloc_lock); > + > + for_each_online_node ( i ) > + printk(" Node %u: %u\n", i, page_num_node[i]); > + > + if ( !read_trylock(&d->vnuma_rwlock) ) > + continue; > + > + if ( !d->vnuma ) > + { > + read_unlock(&d->vnuma_rwlock); > + continue; > + } > + > + vnuma = d->vnuma; > + printk(" %u vnodes, %u vcpus, guest physical layout:\n", > + vnuma->nr_vnodes, d->max_vcpus); > + for ( i = 0; i < vnuma->nr_vnodes; i++ ) > + { > + unsigned int start_cpu = ~0U; > + > + if ( vnuma->vnode_to_pnode[i] == NUMA_NO_NODE ) > + printk(" %3u: pnode ???,", i); > + else > + printk(" %3u: pnode %3u,", i, vnuma->vnode_to_pnode[i]); > + > + printk(" vcpus "); > + > + for ( j = 0; j < d->max_vcpus; j++ ) > + { > + if ( !(j & 0x3f) ) > + process_pending_softirqs(); > + > + if ( vnuma->vcpu_to_vnode[j] == i ) > + { > + if ( start_cpu == ~0U ) > + { > + printk("%d", j); j being "unsigned int", would you mind switching to %u here and below? > --- a/xen/include/xen/numa.h > +++ b/xen/include/xen/numa.h > @@ -18,4 +18,71 @@ > (((d)->vcpu != NULL && (d)->vcpu[0] != NULL) \ > ? vcpu_to_node((d)->vcpu[0]) : NUMA_NO_NODE) > > +/* The following content can be used when NUMA feature is enabled */ > +#ifdef CONFIG_NUMA > + > +extern nodeid_t cpu_to_node[NR_CPUS]; > +extern cpumask_t node_to_cpumask[]; > + > +#define cpu_to_node(cpu) cpu_to_node[cpu] > +#define parent_node(node) (node) > +#define node_to_first_cpu(node) __ffs(node_to_cpumask[node]) I can't spot a use of this - perhaps better drop than generalize (if done right here then along with mentioning this in the description)? Jan
On 27.09.2022 10:19, Jan Beulich wrote: > On 20.09.2022 11:12, Wei Chen wrote: >> +static unsigned int __init extract_lsb_from_nodes(const struct node *nodes, >> + nodeid_t numnodes) >> +{ >> + unsigned int i, nodes_used = 0; >> + unsigned long spdx, epdx; >> + unsigned long bitfield = 0, memtop = 0; >> + >> + for ( i = 0; i < numnodes; i++ ) >> + { >> + spdx = paddr_to_pdx(nodes[i].start); >> + epdx = paddr_to_pdx(nodes[i].end - 1) + 1; >> + >> + if ( spdx >= epdx ) >> + continue; >> + >> + bitfield |= spdx; > > Perhaps to be taken care of in a separate patch: We accumulate only > the bits from the start addresses here, contrary to what the comment > ahead of the function says (and I think it is the comment which is > putting things correctly). It's the comment which is wrong - it wasn't updated in Linux commit 54413927f022 ("x86-64: x86_64-make-the-numa-hash-function-nodemap-allocation fix fix"). Our code was cloned from Linux'es. In fact when memory is not contiguous, too small a shift value might be determined. This in particular also may matter for the lowest range covered by any node: On x86 this will always start at zero (and hence won't influence the final calculation), but iirc on Arm memory may (and typically will) start at non-zero addresses. In such a case the first node's start address should be ignored, as it's fine to (kind of) associate all lower addresses (where no memory is) with this same node. But that's now indeed something which will want taking care of while making the code usable for other architectures. Jan
Hi Jan, On 2022/9/27 16:19, Jan Beulich wrote: > On 20.09.2022 11:12, Wei Chen wrote: >> There are some codes in x86/numa.c can be shared by common >> architectures to implememnt NUMA support. Just like some >> variables and functions to check and store NUMA memory map. >> And some variables and functions to do NUMA initialization. >> >> In this patch, we move them to common/numa.c and xen/numa.h >> and use the CONFIG_NUMA to gate them for non-NUMA supported >> architectures. As the target header file is Xen-style, so >> we trim some spaces and replace tabs for the codes that has >> been moved to xen/numa.h at the same time. >> >> As acpi_scan_nodes has been used in a common function, it >> doesn't make sense to use acpi_xxx in common code, so we >> rename it to numa_scan_nodes in this patch too. After that > > numa_process_nodes() now? Oh, yes, I will update it. > >> if we still use CONFIG_ACPI_NUMA in to gate numa_scan_nodes >> in numa_initmem_init, that doesn't make sense. As CONFIG_NUMA >> will be selected by CONFIG_ACPI_NUMA for x86. So, we replace >> CONFIG_ACPI_NUMA by CONFIG_NUMA to gate numa_scan_nodes. >> >> As arch_numa_disabled has been implememnted for ACPI NUMA, >> we can rename srat_disabled to numa_disabled and move it >> to common code as well. > > Please update the description: arch_numa_disabled() appears in patch 5 > only. Of course if you follow the comments to patch 2, the wording here > would be correct again. > I will update the description. >> +static unsigned int __init extract_lsb_from_nodes(const struct node *nodes, >> + nodeid_t numnodes) >> +{ >> + unsigned int i, nodes_used = 0; >> + unsigned long spdx, epdx; >> + unsigned long bitfield = 0, memtop = 0; >> + >> + for ( i = 0; i < numnodes; i++ ) >> + { >> + spdx = paddr_to_pdx(nodes[i].start); >> + epdx = paddr_to_pdx(nodes[i].end - 1) + 1; >> + >> + if ( spdx >= epdx ) >> + continue; >> + >> + bitfield |= spdx; > > Perhaps to be taken care of in a separate patch: We accumulate only > the bits from the start addresses here, contrary to what the comment > ahead of the function says (and I think it is the comment which is > putting things correctly). If one node has non-zero memory, the bit of end will >= the bit of start. As we use this function to calculate LSB, I don't think only accumulate bits of start addresses will be a problem. Instead I think we should modify this function comment to say why we only need to accumulate bits of start addresses. > >> + nodes_used++; >> + if ( epdx > memtop ) >> + memtop = epdx; >> + } >> + >> + if ( nodes_used <= 1 ) >> + i = BITS_PER_LONG - 1; > > Is this actually going to be correct for all architectures? Aiui > Arm64 has only up to 48 physical address bits, but what about an > architecture allowing the use of all 64 bits? I think at the very > least we want BUILD_BUG_ON(PADDR_BITS >= BITS_PER_LONG) here. > Ok I will add above BUILD_BUG_ON. And I also have question why can't we use PADDR_BITS here directly? >> + else >> + i = find_first_bit(&bitfield, sizeof(unsigned long) * 8); >> + >> + memnodemapsize = (memtop >> i) + 1; > > Again perhaps the subject of a separate patch: Isn't there an off-by-1 > mistake here? memtop is the maximum of all epdx-es, which are > calculated to be the first PDX following the region. Hence I'd expect > > memnodemapsize = ((memtop - 1) >> i) + 1; > > here. I guess I'll make patches for both issues, which you may then > need to re-base over. > Thanks, I will wait your patches. >> +static int __init cf_check numa_setup(const char *opt) >> +{ >> + if ( !strncmp(opt, "off", 3) ) >> + numa_off = true; >> + else if ( !strncmp(opt, "on", 2) ) >> + numa_off = false; >> +#ifdef CONFIG_NUMA_EMU >> + else if ( !strncmp(opt, "fake=", 5) ) >> + { >> + numa_off = false; >> + numa_fake = simple_strtoul(opt + 5, NULL, 0); >> + if ( numa_fake >= MAX_NUMNODES ) >> + numa_fake = MAX_NUMNODES; >> + } >> +#endif >> + else >> + return arch_numa_setup(opt); >> + >> + return 0; >> +} >> +custom_param("numa", numa_setup); > > Note that with this moved here at some point during your work (when > allowing NUMA=y for Arm) you'll need to update the command line doc. > I have prepared a patch for this doc in part#3 Arm part code. >> +static void cf_check dump_numa(unsigned char key) >> +{ >> + s_time_t now = NOW(); >> + unsigned int i, j, n; >> + struct domain *d; >> + const struct page_info *page; >> + unsigned int page_num_node[MAX_NUMNODES]; >> + const struct vnuma_info *vnuma; >> + >> + printk("'%c' pressed -> dumping numa info (now = %"PRI_stime")\n", key, >> + now); >> + >> + for_each_online_node ( i ) >> + { >> + paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1); >> + >> + printk("NODE%u start->%lu size->%lu free->%lu\n", >> + i, node_start_pfn(i), node_spanned_pages(i), >> + avail_node_heap_pages(i)); >> + /* Sanity check phys_to_nid() */ >> + if ( phys_to_nid(pa) != i ) >> + printk("phys_to_nid(%"PRIpaddr") -> %d should be %u\n", >> + pa, phys_to_nid(pa), i); >> + } >> + >> + j = cpumask_first(&cpu_online_map); >> + n = 0; >> + for_each_online_cpu ( i ) >> + { >> + if ( i != j + n || cpu_to_node[j] != cpu_to_node[i] ) >> + { >> + if ( n > 1 ) >> + printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]); >> + else >> + printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]); >> + j = i; >> + n = 1; >> + } >> + else >> + ++n; >> + } >> + if ( n > 1 ) >> + printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]); >> + else >> + printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]); >> + >> + rcu_read_lock(&domlist_read_lock); >> + >> + printk("Memory location of each domain:\n"); >> + for_each_domain ( d ) >> + { >> + process_pending_softirqs(); >> + >> + printk("Domain %u (total: %u):\n", d->domain_id, domain_tot_pages(d)); >> + >> + for_each_online_node ( i ) >> + page_num_node[i] = 0; > > I'd be inclined to suggest to use memset() here, but I won't insist > on you doing this "on the fly". Along with this would likely go the > request to limit the scope of page_num_node[] (and then perhaps also > vnuma and page). > memset for page_num_node makes sense, I will do it before for_each_domain ( d ). About limit the scope, did you mean, we should move: "const struct page_info *page; unsigned int page_num_node[MAX_NUMNODES]; const struct vnuma_info *vnuma;" to the block of for_each_domain ( d )? >> + spin_lock(&d->page_alloc_lock); >> + page_list_for_each ( page, &d->page_list ) >> + { >> + i = phys_to_nid(page_to_maddr(page)); >> + page_num_node[i]++; >> + } >> + spin_unlock(&d->page_alloc_lock); >> + >> + for_each_online_node ( i ) >> + printk(" Node %u: %u\n", i, page_num_node[i]); >> + >> + if ( !read_trylock(&d->vnuma_rwlock) ) >> + continue; >> + >> + if ( !d->vnuma ) >> + { >> + read_unlock(&d->vnuma_rwlock); >> + continue; >> + } >> + >> + vnuma = d->vnuma; >> + printk(" %u vnodes, %u vcpus, guest physical layout:\n", >> + vnuma->nr_vnodes, d->max_vcpus); >> + for ( i = 0; i < vnuma->nr_vnodes; i++ ) >> + { >> + unsigned int start_cpu = ~0U; >> + >> + if ( vnuma->vnode_to_pnode[i] == NUMA_NO_NODE ) >> + printk(" %3u: pnode ???,", i); >> + else >> + printk(" %3u: pnode %3u,", i, vnuma->vnode_to_pnode[i]); >> + >> + printk(" vcpus "); >> + >> + for ( j = 0; j < d->max_vcpus; j++ ) >> + { >> + if ( !(j & 0x3f) ) >> + process_pending_softirqs(); >> + >> + if ( vnuma->vcpu_to_vnode[j] == i ) >> + { >> + if ( start_cpu == ~0U ) >> + { >> + printk("%d", j); > > j being "unsigned int", would you mind switching to %u here and below? > Ok, I will do it and below. >> --- a/xen/include/xen/numa.h >> +++ b/xen/include/xen/numa.h >> @@ -18,4 +18,71 @@ >> (((d)->vcpu != NULL && (d)->vcpu[0] != NULL) \ >> ? vcpu_to_node((d)->vcpu[0]) : NUMA_NO_NODE) >> >> +/* The following content can be used when NUMA feature is enabled */ >> +#ifdef CONFIG_NUMA >> + >> +extern nodeid_t cpu_to_node[NR_CPUS]; >> +extern cpumask_t node_to_cpumask[]; >> + >> +#define cpu_to_node(cpu) cpu_to_node[cpu] >> +#define parent_node(node) (node) >> +#define node_to_first_cpu(node) __ffs(node_to_cpumask[node]) > > I can't spot a use of this - perhaps better drop than generalize (if > done right here then along with mentioning this in the description)? > Yes, there is no code using this macro anymore, I will delete it and mention it in the commit log. Cheers, Wei Chen > Jan
Hi Jan, On 2022/9/27 17:39, Jan Beulich wrote: > It's the comment which is wrong - it wasn't updated in Linux commit > 54413927f022 ("x86-64: x86_64-make-the-numa-hash-function-nodemap-allocation > fix fix"). Our code was cloned from Linux'es. In fact when memory is not > contiguous, too small a shift value might be determined. This in particular > also may matter for the lowest range covered by any node: On x86 this will > always start at zero (and hence won't influence the final calculation), but > iirc on Arm memory may (and typically will) start at non-zero addresses. In > such a case the first node's start address should be ignored, as it's fine > to (kind of) associate all lower addresses (where no memory is) with this > same node. But that's now indeed something which will want taking care of > while making the code usable for other architectures Sorry, I hadn't read this before my last reply. I think I kind of understand what you mean now. If we do not ignore the lowest memory start address in the node, then the shift we calculate may contain a part of non-ram physical addresses. But if we ignore it, which address should we start from? Counting from end may ignore most of the ram area on this node. Cheers, Wei Chen
On 29.09.2022 09:43, Wei Chen wrote: > On 2022/9/27 16:19, Jan Beulich wrote: >> On 20.09.2022 11:12, Wei Chen wrote: >>> + nodes_used++; >>> + if ( epdx > memtop ) >>> + memtop = epdx; >>> + } >>> + >>> + if ( nodes_used <= 1 ) >>> + i = BITS_PER_LONG - 1; >> >> Is this actually going to be correct for all architectures? Aiui >> Arm64 has only up to 48 physical address bits, but what about an >> architecture allowing the use of all 64 bits? I think at the very >> least we want BUILD_BUG_ON(PADDR_BITS >= BITS_PER_LONG) here. >> > > Ok I will add above BUILD_BUG_ON. And I also have question why can't > we use PADDR_BITS here directly? Well, if you used PADDR_BITS, then you would use it without subtracting 1, and you'd be in trouble again when PADDR_BITS == BITS_PER_LONG. What may be possible to do instead of BUILD_BUG_ON() is if ( nodes_used <= 1 ) i = min(PADDR_BITS, BITS_PER_LONG - 1); >>> + else >>> + i = find_first_bit(&bitfield, sizeof(unsigned long) * 8); >>> + >>> + memnodemapsize = (memtop >> i) + 1; >> >> Again perhaps the subject of a separate patch: Isn't there an off-by-1 >> mistake here? memtop is the maximum of all epdx-es, which are >> calculated to be the first PDX following the region. Hence I'd expect >> >> memnodemapsize = ((memtop - 1) >> i) + 1; >> >> here. I guess I'll make patches for both issues, which you may then >> need to re-base over. >> > > Thanks, I will wait your patches. Already sent out yesterday. >>> +static void cf_check dump_numa(unsigned char key) >>> +{ >>> + s_time_t now = NOW(); >>> + unsigned int i, j, n; >>> + struct domain *d; >>> + const struct page_info *page; >>> + unsigned int page_num_node[MAX_NUMNODES]; >>> + const struct vnuma_info *vnuma; >>> + >>> + printk("'%c' pressed -> dumping numa info (now = %"PRI_stime")\n", key, >>> + now); >>> + >>> + for_each_online_node ( i ) >>> + { >>> + paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1); >>> + >>> + printk("NODE%u start->%lu size->%lu free->%lu\n", >>> + i, node_start_pfn(i), node_spanned_pages(i), >>> + avail_node_heap_pages(i)); >>> + /* Sanity check phys_to_nid() */ >>> + if ( phys_to_nid(pa) != i ) >>> + printk("phys_to_nid(%"PRIpaddr") -> %d should be %u\n", >>> + pa, phys_to_nid(pa), i); >>> + } >>> + >>> + j = cpumask_first(&cpu_online_map); >>> + n = 0; >>> + for_each_online_cpu ( i ) >>> + { >>> + if ( i != j + n || cpu_to_node[j] != cpu_to_node[i] ) >>> + { >>> + if ( n > 1 ) >>> + printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]); >>> + else >>> + printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]); >>> + j = i; >>> + n = 1; >>> + } >>> + else >>> + ++n; >>> + } >>> + if ( n > 1 ) >>> + printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]); >>> + else >>> + printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]); >>> + >>> + rcu_read_lock(&domlist_read_lock); >>> + >>> + printk("Memory location of each domain:\n"); >>> + for_each_domain ( d ) >>> + { >>> + process_pending_softirqs(); >>> + >>> + printk("Domain %u (total: %u):\n", d->domain_id, domain_tot_pages(d)); >>> + >>> + for_each_online_node ( i ) >>> + page_num_node[i] = 0; >> >> I'd be inclined to suggest to use memset() here, but I won't insist >> on you doing this "on the fly". Along with this would likely go the >> request to limit the scope of page_num_node[] (and then perhaps also >> vnuma and page). >> > > memset for page_num_node makes sense, I will do it before > for_each_domain ( d ). That won't be right - array elements need clearing on every iteration. Plus ... > About limit the scope, did you mean, we should move: > > "const struct page_info *page; > unsigned int page_num_node[MAX_NUMNODES]; > const struct vnuma_info *vnuma;" > > to the block of for_each_domain ( d )? ... this limiting of scope (yes to your question) would also conflict with the movement you suggest. It is actually (among other things) such a mistaken movement which the more narrow scope is intended to prevent. Jan
Hi Jan, > -----Original Message----- > From: Jan Beulich <jbeulich@suse.com> > Sent: 2022年9月29日 20:14 > To: Wei Chen <Wei.Chen@arm.com> > Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau > Monné <roger.pau@citrix.com>; Wei Liu <wl@xen.org>; George Dunlap > <george.dunlap@citrix.com>; Julien Grall <julien@xen.org>; Stefano > Stabellini <sstabellini@kernel.org>; xen-devel@lists.xenproject.org > Subject: Re: [PATCH v5 2/6] xen/x86: move generically usable NUMA code > from x86 to common > > On 29.09.2022 09:43, Wei Chen wrote: > > On 2022/9/27 16:19, Jan Beulich wrote: > >> On 20.09.2022 11:12, Wei Chen wrote: > >>> + nodes_used++; > >>> + if ( epdx > memtop ) > >>> + memtop = epdx; > >>> + } > >>> + > >>> + if ( nodes_used <= 1 ) > >>> + i = BITS_PER_LONG - 1; > >> > >> Is this actually going to be correct for all architectures? Aiui > >> Arm64 has only up to 48 physical address bits, but what about an > >> architecture allowing the use of all 64 bits? I think at the very > >> least we want BUILD_BUG_ON(PADDR_BITS >= BITS_PER_LONG) here. > >> > > > > Ok I will add above BUILD_BUG_ON. And I also have question why can't > > we use PADDR_BITS here directly? > > Well, if you used PADDR_BITS, then you would use it without subtracting > 1, and you'd be in trouble again when PADDR_BITS == BITS_PER_LONG. What > may be possible to do instead of BUILD_BUG_ON() is > > if ( nodes_used <= 1 ) > i = min(PADDR_BITS, BITS_PER_LONG - 1); > This one seems better, I will follow it. > >>> + else > >>> + i = find_first_bit(&bitfield, sizeof(unsigned long) * 8); > >>> + > >>> + memnodemapsize = (memtop >> i) + 1; > >> > >> Again perhaps the subject of a separate patch: Isn't there an off-by-1 > >> mistake here? memtop is the maximum of all epdx-es, which are > >> calculated to be the first PDX following the region. Hence I'd expect > >> > >> memnodemapsize = ((memtop - 1) >> i) + 1; > >> > >> here. I guess I'll make patches for both issues, which you may then > >> need to re-base over. > >> > > > > Thanks, I will wait your patches. > > Already sent out yesterday. > Ok. > >>> +static void cf_check dump_numa(unsigned char key) > >>> +{ > >>> + s_time_t now = NOW(); > >>> + unsigned int i, j, n; > >>> + struct domain *d; > >>> + const struct page_info *page; > >>> + unsigned int page_num_node[MAX_NUMNODES]; > >>> + const struct vnuma_info *vnuma; > >>> + > >>> + printk("'%c' pressed -> dumping numa info (now = %"PRI_stime")\n", > key, > >>> + now); > >>> + > >>> + for_each_online_node ( i ) > >>> + { > >>> + paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1); > >>> + > >>> + printk("NODE%u start->%lu size->%lu free->%lu\n", > >>> + i, node_start_pfn(i), node_spanned_pages(i), > >>> + avail_node_heap_pages(i)); > >>> + /* Sanity check phys_to_nid() */ > >>> + if ( phys_to_nid(pa) != i ) > >>> + printk("phys_to_nid(%"PRIpaddr") -> %d should be %u\n", > >>> + pa, phys_to_nid(pa), i); > >>> + } > >>> + > >>> + j = cpumask_first(&cpu_online_map); > >>> + n = 0; > >>> + for_each_online_cpu ( i ) > >>> + { > >>> + if ( i != j + n || cpu_to_node[j] != cpu_to_node[i] ) > >>> + { > >>> + if ( n > 1 ) > >>> + printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, > cpu_to_node[j]); > >>> + else > >>> + printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]); > >>> + j = i; > >>> + n = 1; > >>> + } > >>> + else > >>> + ++n; > >>> + } > >>> + if ( n > 1 ) > >>> + printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, > cpu_to_node[j]); > >>> + else > >>> + printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]); > >>> + > >>> + rcu_read_lock(&domlist_read_lock); > >>> + > >>> + printk("Memory location of each domain:\n"); > >>> + for_each_domain ( d ) > >>> + { > >>> + process_pending_softirqs(); > >>> + > >>> + printk("Domain %u (total: %u):\n", d->domain_id, > domain_tot_pages(d)); > >>> + > >>> + for_each_online_node ( i ) > >>> + page_num_node[i] = 0; > >> > >> I'd be inclined to suggest to use memset() here, but I won't insist > >> on you doing this "on the fly". Along with this would likely go the > >> request to limit the scope of page_num_node[] (and then perhaps also > >> vnuma and page). > >> > > > > memset for page_num_node makes sense, I will do it before > > for_each_domain ( d ). > > That won't be right - array elements need clearing on every iteration. > Plus ... > Oh, Yes. > > About limit the scope, did you mean, we should move: > > > > "const struct page_info *page; > > unsigned int page_num_node[MAX_NUMNODES]; > > const struct vnuma_info *vnuma;" > > > > to the block of for_each_domain ( d )? > > ... this limiting of scope (yes to your question) would also conflict > with the movement you suggest. It is actually (among other things) > such a mistaken movement which the more narrow scope is intended to > prevent. > Thanks, Wei Chen > Jan
diff --git a/xen/arch/x86/include/asm/acpi.h b/xen/arch/x86/include/asm/acpi.h index 9a9cc4c240..5c2dd5da2d 100644 --- a/xen/arch/x86/include/asm/acpi.h +++ b/xen/arch/x86/include/asm/acpi.h @@ -102,7 +102,6 @@ extern unsigned long acpi_wakeup_address; #define ARCH_HAS_POWER_INIT 1 extern s8 acpi_numa; -extern int acpi_scan_nodes(u64 start, u64 end); #define NR_NODE_MEMBLKS (MAX_NUMNODES*2) extern struct acpi_sleep_info acpi_sinfo; diff --git a/xen/arch/x86/include/asm/numa.h b/xen/arch/x86/include/asm/numa.h index 529efadf93..6c87942d43 100644 --- a/xen/arch/x86/include/asm/numa.h +++ b/xen/arch/x86/include/asm/numa.h @@ -9,72 +9,17 @@ typedef u8 nodeid_t; extern int srat_rev; -extern nodeid_t cpu_to_node[NR_CPUS]; -extern cpumask_t node_to_cpumask[]; - -#define cpu_to_node(cpu) (cpu_to_node[cpu]) -#define parent_node(node) (node) -#define node_to_first_cpu(node) (__ffs(node_to_cpumask[node])) -#define node_to_cpumask(node) (node_to_cpumask[node]) - -struct node { - paddr_t start, end; -}; - -extern int compute_hash_shift(struct node *nodes, int numnodes, - nodeid_t *nodeids); extern nodeid_t pxm_to_node(unsigned int pxm); #define ZONE_ALIGN (1UL << (MAX_ORDER+PAGE_SHIFT)) -#define VIRTUAL_BUG_ON(x) -extern void numa_add_cpu(int cpu); -extern void numa_init_array(void); -extern bool numa_off; - -extern int arch_numa_setup(const char *opt); -extern bool arch_numa_broken(void); -extern bool srat_disabled(void); -extern void numa_set_node(int cpu, nodeid_t node); +extern bool numa_disabled(void); extern nodeid_t setup_node(unsigned int pxm); extern void srat_detect_node(int cpu); -extern void setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end); extern nodeid_t apicid_to_node[]; extern void init_cpu_to_node(void); -static inline void clear_node_cpumask(int cpu) -{ - cpumask_clear_cpu(cpu, &node_to_cpumask[cpu_to_node(cpu)]); -} - -/* Simple perfect hash to map pdx to node numbers */ -extern int memnode_shift; -extern unsigned long memnodemapsize; -extern u8 *memnodemap; - -struct node_data { - unsigned long node_start_pfn; - unsigned long node_spanned_pages; -}; - -extern struct node_data node_data[]; - -static inline __attribute__((pure)) nodeid_t phys_to_nid(paddr_t addr) -{ - nodeid_t nid; - VIRTUAL_BUG_ON((paddr_to_pdx(addr) >> memnode_shift) >= memnodemapsize); - nid = memnodemap[paddr_to_pdx(addr) >> memnode_shift]; - VIRTUAL_BUG_ON(nid >= MAX_NUMNODES || !node_data[nid]); - return nid; -} - -#define NODE_DATA(nid) (&(node_data[nid])) - -#define node_start_pfn(nid) (NODE_DATA(nid)->node_start_pfn) -#define node_spanned_pages(nid) (NODE_DATA(nid)->node_spanned_pages) -#define node_end_pfn(nid) (NODE_DATA(nid)->node_start_pfn + \ - NODE_DATA(nid)->node_spanned_pages) #define arch_want_default_dmazone() (num_online_nodes() > 1) extern int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node); diff --git a/xen/arch/x86/include/asm/setup.h b/xen/arch/x86/include/asm/setup.h index 21037b7f31..ae470ea12f 100644 --- a/xen/arch/x86/include/asm/setup.h +++ b/xen/arch/x86/include/asm/setup.h @@ -20,7 +20,6 @@ void early_time_init(void); void set_nr_cpu_ids(unsigned int max_cpus); -void numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn); void arch_init_memory(void); void subarch_init_memory(void); diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c index 1ab37b9c19..21efb1b1b3 100644 --- a/xen/arch/x86/numa.c +++ b/xen/arch/x86/numa.c @@ -4,20 +4,11 @@ * Adapted for Xen: Ryan Harper <ryanh@us.ibm.com> */ -#include <xen/mm.h> -#include <xen/string.h> #include <xen/init.h> -#include <xen/ctype.h> +#include <xen/mm.h> #include <xen/nodemask.h> #include <xen/numa.h> -#include <xen/keyhandler.h> -#include <xen/param.h> -#include <xen/time.h> -#include <xen/smp.h> -#include <xen/pfn.h> #include <asm/acpi.h> -#include <xen/sched.h> -#include <xen/softirq.h> #ifndef Dprintk #define Dprintk(x...) @@ -26,28 +17,13 @@ /* from proto.h */ #define round_up(x,y) ((((x)+(y))-1) & (~((y)-1))) -struct node_data node_data[MAX_NUMNODES]; - -/* Mapping from pdx to node id */ -int memnode_shift; -static typeof(*memnodemap) _memnodemap[64]; -unsigned long memnodemapsize; -u8 *memnodemap; - -nodeid_t cpu_to_node[NR_CPUS] __read_mostly = { - [0 ... NR_CPUS-1] = NUMA_NO_NODE -}; /* * Keep BIOS's CPU2node information, should not be used for memory allocaion */ nodeid_t apicid_to_node[MAX_LOCAL_APIC] = { [0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE }; -cpumask_t node_to_cpumask[MAX_NUMNODES] __read_mostly; -nodemask_t __read_mostly node_online_map = { { [0] = 1UL } }; - -bool numa_off; s8 acpi_numa = 0; int __init arch_numa_setup(const char *opt) @@ -69,267 +45,6 @@ bool arch_numa_broken(void) return acpi_numa < 0; } -bool srat_disabled(void) -{ - return numa_off || arch_numa_broken(); -} - -/* - * Given a shift value, try to populate memnodemap[] - * Returns : - * 1 if OK - * 0 if memnodmap[] too small (of shift too small) - * -1 if node overlap or lost ram (shift too big) - */ -static int __init populate_memnodemap(const struct node *nodes, - int numnodes, int shift, nodeid_t *nodeids) -{ - unsigned long spdx, epdx; - int i, res = -1; - - memset(memnodemap, NUMA_NO_NODE, memnodemapsize * sizeof(*memnodemap)); - for ( i = 0; i < numnodes; i++ ) - { - spdx = paddr_to_pdx(nodes[i].start); - epdx = paddr_to_pdx(nodes[i].end - 1) + 1; - if ( spdx >= epdx ) - continue; - if ( (epdx >> shift) >= memnodemapsize ) - return 0; - do { - if ( memnodemap[spdx >> shift] != NUMA_NO_NODE ) - return -1; - - if ( !nodeids ) - memnodemap[spdx >> shift] = i; - else - memnodemap[spdx >> shift] = nodeids[i]; - - spdx += (1UL << shift); - } while ( spdx < epdx ); - res = 1; - } - - return res; -} - -static int __init allocate_cachealigned_memnodemap(void) -{ - unsigned long size = PFN_UP(memnodemapsize * sizeof(*memnodemap)); - unsigned long mfn = mfn_x(alloc_boot_pages(size, 1)); - - memnodemap = mfn_to_virt(mfn); - mfn <<= PAGE_SHIFT; - size <<= PAGE_SHIFT; - printk(KERN_DEBUG "NUMA: Allocated memnodemap from %lx - %lx\n", - mfn, mfn + size); - memnodemapsize = size / sizeof(*memnodemap); - - return 0; -} - -/* - * The LSB of all start and end addresses in the node map is the value of the - * maximum possible shift. - */ -static int __init extract_lsb_from_nodes(const struct node *nodes, - int numnodes) -{ - int i, nodes_used = 0; - unsigned long spdx, epdx; - unsigned long bitfield = 0, memtop = 0; - - for ( i = 0; i < numnodes; i++ ) - { - spdx = paddr_to_pdx(nodes[i].start); - epdx = paddr_to_pdx(nodes[i].end - 1) + 1; - if ( spdx >= epdx ) - continue; - bitfield |= spdx; - nodes_used++; - if ( epdx > memtop ) - memtop = epdx; - } - if ( nodes_used <= 1 ) - i = BITS_PER_LONG - 1; - else - i = find_first_bit(&bitfield, sizeof(unsigned long)*8); - memnodemapsize = (memtop >> i) + 1; - return i; -} - -int __init compute_hash_shift(struct node *nodes, int numnodes, - nodeid_t *nodeids) -{ - int shift; - - shift = extract_lsb_from_nodes(nodes, numnodes); - if ( memnodemapsize <= ARRAY_SIZE(_memnodemap) ) - memnodemap = _memnodemap; - else if ( allocate_cachealigned_memnodemap() ) - return -1; - printk(KERN_DEBUG "NUMA: Using %d for the hash shift.\n", shift); - - if ( populate_memnodemap(nodes, numnodes, shift, nodeids) != 1 ) - { - printk(KERN_INFO "Your memory is not aligned you need to " - "rebuild your hypervisor with a bigger NODEMAPSIZE " - "shift=%d\n", shift); - return -1; - } - - return shift; -} -/* initialize NODE_DATA given nodeid and start/end */ -void __init setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end) -{ - unsigned long start_pfn = paddr_to_pfn(start); - unsigned long end_pfn = paddr_to_pfn(end); - - NODE_DATA(nodeid)->node_start_pfn = start_pfn; - NODE_DATA(nodeid)->node_spanned_pages = end_pfn - start_pfn; - - node_set_online(nodeid); -} - -void __init numa_init_array(void) -{ - int rr, i; - - /* There are unfortunately some poorly designed mainboards around - that only connect memory to a single CPU. This breaks the 1:1 cpu->node - mapping. To avoid this fill in the mapping for all possible - CPUs, as the number of CPUs is not known yet. - We round robin the existing nodes. */ - rr = first_node(node_online_map); - for ( i = 0; i < nr_cpu_ids; i++ ) - { - if ( cpu_to_node[i] != NUMA_NO_NODE ) - continue; - numa_set_node(i, rr); - rr = cycle_node(rr, node_online_map); - } -} - -#ifdef CONFIG_NUMA_EMU -static int numa_fake __initdata = 0; - -/* Numa emulation */ -static int __init numa_emulation(unsigned long start_pfn, - unsigned long end_pfn) -{ - int i; - struct node nodes[MAX_NUMNODES]; - uint64_t sz = pfn_to_paddr(end_pfn - start_pfn) / numa_fake; - - /* Kludge needed for the hash function */ - if ( hweight64(sz) > 1 ) - { - u64 x = 1; - while ( (x << 1) < sz ) - x <<= 1; - if ( x < sz/2 ) - printk(KERN_ERR "Numa emulation unbalanced. Complain to maintainer\n"); - sz = x; - } - - memset(&nodes,0,sizeof(nodes)); - for ( i = 0; i < numa_fake; i++ ) - { - nodes[i].start = pfn_to_paddr(start_pfn) + i * sz; - if ( i == numa_fake - 1 ) - sz = pfn_to_paddr(end_pfn) - nodes[i].start; - nodes[i].end = nodes[i].start + sz; - printk(KERN_INFO "Faking node %d at %"PRIx64"-%"PRIx64" (%"PRIu64"MB)\n", - i, - nodes[i].start, nodes[i].end, - (nodes[i].end - nodes[i].start) >> 20); - node_set_online(i); - } - memnode_shift = compute_hash_shift(nodes, numa_fake, NULL); - if ( memnode_shift < 0 ) - { - memnode_shift = 0; - printk(KERN_ERR "No NUMA hash function found. Emulation disabled.\n"); - return -1; - } - for_each_online_node ( i ) - setup_node_bootmem(i, nodes[i].start, nodes[i].end); - numa_init_array(); - - return 0; -} -#endif - -void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn) -{ - int i; - paddr_t start = pfn_to_paddr(start_pfn); - paddr_t end = pfn_to_paddr(end_pfn); - -#ifdef CONFIG_NUMA_EMU - if ( numa_fake && !numa_emulation(start_pfn, end_pfn) ) - return; -#endif - -#ifdef CONFIG_ACPI_NUMA - if ( !numa_off && !acpi_scan_nodes(start, end) ) - return; -#endif - - printk(KERN_INFO "%s\n", - numa_off ? "NUMA turned off" : "No NUMA configuration found"); - - printk(KERN_INFO "Faking a node at %"PRIpaddr"-%"PRIpaddr"\n", - start, end); - /* setup dummy node covering all memory */ - memnode_shift = BITS_PER_LONG - 1; - memnodemap = _memnodemap; - /* Dummy node only uses 1 slot in reality */ - memnodemap[0] = 0; - memnodemapsize = 1; - - nodes_clear(node_online_map); - node_set_online(0); - for ( i = 0; i < nr_cpu_ids; i++ ) - numa_set_node(i, 0); - cpumask_copy(&node_to_cpumask[0], cpumask_of(0)); - setup_node_bootmem(0, start, end); -} - -void numa_add_cpu(int cpu) -{ - cpumask_set_cpu(cpu, &node_to_cpumask[cpu_to_node(cpu)]); -} - -void numa_set_node(int cpu, nodeid_t node) -{ - cpu_to_node[cpu] = node; -} - -/* [numa=off] */ -static int __init cf_check numa_setup(const char *opt) -{ - if ( !strncmp(opt, "off", 3) ) - numa_off = true; - else if ( !strncmp(opt, "on", 2) ) - numa_off = false; -#ifdef CONFIG_NUMA_EMU - else if ( !strncmp(opt, "fake=", 5) ) - { - numa_off = false; - numa_fake = simple_strtoul(opt + 5, NULL, 0); - if ( numa_fake >= MAX_NUMNODES ) - numa_fake = MAX_NUMNODES; - } -#endif - else - return arch_numa_setup(opt); - - return 0; -} -custom_param("numa", numa_setup); - /* * Setup early cpu_to_node. * @@ -378,146 +93,3 @@ unsigned int __init arch_get_dma_bitsize(void) flsl(node_start_pfn(node) + node_spanned_pages(node) / 4 - 1) + PAGE_SHIFT, 32); } - -static void cf_check dump_numa(unsigned char key) -{ - s_time_t now = NOW(); - unsigned int i, j, n; - struct domain *d; - struct page_info *page; - unsigned int page_num_node[MAX_NUMNODES]; - const struct vnuma_info *vnuma; - - printk("'%c' pressed -> dumping numa info (now = %"PRI_stime")\n", key, - now); - - for_each_online_node ( i ) - { - paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1); - - printk("NODE%u start->%lu size->%lu free->%lu\n", - i, node_start_pfn(i), node_spanned_pages(i), - avail_node_heap_pages(i)); - /* sanity check phys_to_nid() */ - if ( phys_to_nid(pa) != i ) - printk("phys_to_nid(%"PRIpaddr") -> %d should be %u\n", - pa, phys_to_nid(pa), i); - } - - j = cpumask_first(&cpu_online_map); - n = 0; - for_each_online_cpu ( i ) - { - if ( i != j + n || cpu_to_node[j] != cpu_to_node[i] ) - { - if ( n > 1 ) - printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]); - else - printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]); - j = i; - n = 1; - } - else - ++n; - } - if ( n > 1 ) - printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]); - else - printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]); - - rcu_read_lock(&domlist_read_lock); - - printk("Memory location of each domain:\n"); - for_each_domain ( d ) - { - process_pending_softirqs(); - - printk("Domain %u (total: %u):\n", d->domain_id, domain_tot_pages(d)); - - for_each_online_node ( i ) - page_num_node[i] = 0; - - spin_lock(&d->page_alloc_lock); - page_list_for_each(page, &d->page_list) - { - i = phys_to_nid(page_to_maddr(page)); - page_num_node[i]++; - } - spin_unlock(&d->page_alloc_lock); - - for_each_online_node ( i ) - printk(" Node %u: %u\n", i, page_num_node[i]); - - if ( !read_trylock(&d->vnuma_rwlock) ) - continue; - - if ( !d->vnuma ) - { - read_unlock(&d->vnuma_rwlock); - continue; - } - - vnuma = d->vnuma; - printk(" %u vnodes, %u vcpus, guest physical layout:\n", - vnuma->nr_vnodes, d->max_vcpus); - for ( i = 0; i < vnuma->nr_vnodes; i++ ) - { - unsigned int start_cpu = ~0U; - - if ( vnuma->vnode_to_pnode[i] == NUMA_NO_NODE ) - printk(" %3u: pnode ???,", i); - else - printk(" %3u: pnode %3u,", i, vnuma->vnode_to_pnode[i]); - - printk(" vcpus "); - - for ( j = 0; j < d->max_vcpus; j++ ) - { - if ( !(j & 0x3f) ) - process_pending_softirqs(); - - if ( vnuma->vcpu_to_vnode[j] == i ) - { - if ( start_cpu == ~0U ) - { - printk("%d", j); - start_cpu = j; - } - } - else if ( start_cpu != ~0U ) - { - if ( j - 1 != start_cpu ) - printk("-%d ", j - 1); - else - printk(" "); - start_cpu = ~0U; - } - } - - if ( start_cpu != ~0U && start_cpu != j - 1 ) - printk("-%d", j - 1); - - printk("\n"); - - for ( j = 0; j < vnuma->nr_vmemranges; j++ ) - { - if ( vnuma->vmemrange[j].nid == i ) - printk(" %016"PRIx64" - %016"PRIx64"\n", - vnuma->vmemrange[j].start, - vnuma->vmemrange[j].end); - } - } - - read_unlock(&d->vnuma_rwlock); - } - - rcu_read_unlock(&domlist_read_lock); -} - -static int __init cf_check register_numa_trigger(void) -{ - register_keyhandler('u', dump_numa, "dump NUMA info", 1); - return 0; -} -__initcall(register_numa_trigger); - diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c index b46fd9ab18..9df08e9366 100644 --- a/xen/arch/x86/smpboot.c +++ b/xen/arch/x86/smpboot.c @@ -1350,7 +1350,7 @@ int cpu_add(uint32_t apic_id, uint32_t acpi_id, uint32_t pxm) x86_acpiid_to_apicid[acpi_id] = apic_id; - if ( !srat_disabled() ) + if ( !numa_disabled() ) { nodeid_t node = setup_node(pxm); diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c index b62a152911..0d4f7cccb9 100644 --- a/xen/arch/x86/srat.c +++ b/xen/arch/x86/srat.c @@ -238,7 +238,7 @@ acpi_numa_x2apic_affinity_init(const struct acpi_srat_x2apic_cpu_affinity *pa) unsigned pxm; nodeid_t node; - if (srat_disabled()) + if (numa_disabled()) return; if (pa->header.length < sizeof(struct acpi_srat_x2apic_cpu_affinity)) { bad_srat(); @@ -274,7 +274,7 @@ acpi_numa_processor_affinity_init(const struct acpi_srat_cpu_affinity *pa) unsigned pxm; nodeid_t node; - if (srat_disabled()) + if (numa_disabled()) return; if (pa->header.length != sizeof(struct acpi_srat_cpu_affinity)) { bad_srat(); @@ -313,7 +313,7 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma) nodeid_t node; unsigned int i; - if (srat_disabled()) + if (numa_disabled()) return; if (ma->header.length != sizeof(struct acpi_srat_mem_affinity)) { bad_srat(); @@ -519,8 +519,8 @@ void __init srat_parse_regions(paddr_t addr) pfn_pdx_hole_setup(mask >> PAGE_SHIFT); } -/* Use the information discovered above to actually set up the nodes. */ -int __init acpi_scan_nodes(paddr_t start, paddr_t end) +/* Use discovered information to actually set up the nodes. */ +int __init numa_process_nodes(paddr_t start, paddr_t end) { int i; nodemask_t all_nodes_parsed; diff --git a/xen/common/Makefile b/xen/common/Makefile index 3baf83d527..9a3a12b12d 100644 --- a/xen/common/Makefile +++ b/xen/common/Makefile @@ -26,6 +26,7 @@ obj-$(CONFIG_MEM_ACCESS) += mem_access.o obj-y += memory.o obj-y += multicall.o obj-y += notifier.o +obj-$(CONFIG_NUMA) += numa.o obj-y += page_alloc.o obj-$(CONFIG_HAS_PDX) += pdx.o obj-$(CONFIG_PERF_COUNTERS) += perfc.o diff --git a/xen/common/numa.c b/xen/common/numa.c new file mode 100644 index 0000000000..83f4c8cc94 --- /dev/null +++ b/xen/common/numa.c @@ -0,0 +1,460 @@ +/* + * Generic VM initialization for NUMA setups. + * Copyright 2002,2003 Andi Kleen, SuSE Labs. + * Adapted for Xen: Ryan Harper <ryanh@us.ibm.com> + */ + +#include <xen/init.h> +#include <xen/keyhandler.h> +#include <xen/mm.h> +#include <xen/nodemask.h> +#include <xen/numa.h> +#include <xen/param.h> +#include <xen/sched.h> +#include <xen/softirq.h> + +struct node_data __ro_after_init node_data[MAX_NUMNODES]; + +/* Mapping from pdx to node id */ +unsigned int __ro_after_init memnode_shift; +unsigned long __ro_after_init memnodemapsize; +nodeid_t *__ro_after_init memnodemap; +static typeof(*memnodemap) __ro_after_init _memnodemap[64]; + +nodeid_t __read_mostly cpu_to_node[NR_CPUS] = { + [0 ... NR_CPUS-1] = NUMA_NO_NODE +}; + +cpumask_t __read_mostly node_to_cpumask[MAX_NUMNODES]; + +nodemask_t __read_mostly node_online_map = { { [0] = 1UL } }; + +bool __ro_after_init numa_off; + +bool numa_disabled(void) +{ + return numa_off || arch_numa_broken(); +} + +/* + * Given a shift value, try to populate memnodemap[] + * Returns : + * 1 if OK + * 0 if memnodmap[] too small (of shift too small) + * -1 if node overlap or lost ram (shift too big) + */ +static int __init populate_memnodemap(const struct node *nodes, + unsigned int numnodes, unsigned int shift, + const nodeid_t *nodeids) +{ + unsigned long spdx, epdx; + unsigned int i; + int res = -1; + + memset(memnodemap, NUMA_NO_NODE, memnodemapsize * sizeof(*memnodemap)); + + for ( i = 0; i < numnodes; i++ ) + { + spdx = paddr_to_pdx(nodes[i].start); + epdx = paddr_to_pdx(nodes[i].end - 1) + 1; + + if ( spdx >= epdx ) + continue; + + if ( (epdx >> shift) >= memnodemapsize ) + return 0; + + do { + if ( memnodemap[spdx >> shift] != NUMA_NO_NODE ) + return -1; + + if ( !nodeids ) + memnodemap[spdx >> shift] = i; + else + memnodemap[spdx >> shift] = nodeids[i]; + + spdx += (1UL << shift); + } while ( spdx < epdx ); + + res = 1; + } + + return res; +} + +static int __init allocate_cachealigned_memnodemap(void) +{ + unsigned long size = PFN_UP(memnodemapsize * sizeof(*memnodemap)); + unsigned long mfn = mfn_x(alloc_boot_pages(size, 1)); + + memnodemap = mfn_to_virt(mfn); + mfn <<= PAGE_SHIFT; + size <<= PAGE_SHIFT; + printk(KERN_DEBUG "NUMA: Allocated memnodemap from %lx - %lx\n", + mfn, mfn + size); + memnodemapsize = size / sizeof(*memnodemap); + + return 0; +} + +/* + * The LSB of all start and end addresses in the node map is the value of the + * maximum possible shift. + */ +static unsigned int __init extract_lsb_from_nodes(const struct node *nodes, + nodeid_t numnodes) +{ + unsigned int i, nodes_used = 0; + unsigned long spdx, epdx; + unsigned long bitfield = 0, memtop = 0; + + for ( i = 0; i < numnodes; i++ ) + { + spdx = paddr_to_pdx(nodes[i].start); + epdx = paddr_to_pdx(nodes[i].end - 1) + 1; + + if ( spdx >= epdx ) + continue; + + bitfield |= spdx; + nodes_used++; + if ( epdx > memtop ) + memtop = epdx; + } + + if ( nodes_used <= 1 ) + i = BITS_PER_LONG - 1; + else + i = find_first_bit(&bitfield, sizeof(unsigned long) * 8); + + memnodemapsize = (memtop >> i) + 1; + + return i; +} + +int __init compute_hash_shift(const struct node *nodes, + unsigned int numnodes, const nodeid_t *nodeids) +{ + unsigned int shift = extract_lsb_from_nodes(nodes, numnodes); + + if ( memnodemapsize <= ARRAY_SIZE(_memnodemap) ) + memnodemap = _memnodemap; + else if ( allocate_cachealigned_memnodemap() ) + return -1; + + printk(KERN_DEBUG "NUMA: Using %u for the hash shift\n", shift); + + if ( populate_memnodemap(nodes, numnodes, shift, nodeids) != 1 ) + { + printk(KERN_INFO "Your memory is not aligned you need to " + "rebuild your hypervisor with a bigger NODEMAPSIZE " + "shift=%u\n", shift); + return -1; + } + + return shift; +} + +/* Initialize NODE_DATA given nodeid and start/end */ +void __init setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end) +{ + unsigned long start_pfn = paddr_to_pfn(start); + unsigned long end_pfn = paddr_to_pfn(end); + + NODE_DATA(nodeid)->node_start_pfn = start_pfn; + NODE_DATA(nodeid)->node_spanned_pages = end_pfn - start_pfn; + + node_set_online(nodeid); +} + +void __init numa_init_array(void) +{ + unsigned int i; + nodeid_t rr; + + /* + * There are unfortunately some poorly designed mainboards + * around that only connect memory to a single CPU. This + * breaks the 1:1 cpu->node mapping. To avoid this fill in + * the mapping for all possible CPUs, as the number of CPUs + * is not known yet. We round robin the existing nodes. + */ + rr = first_node(node_online_map); + for ( i = 0; i < nr_cpu_ids; i++ ) + { + if ( cpu_to_node[i] != NUMA_NO_NODE ) + continue; + numa_set_node(i, rr); + rr = cycle_node(rr, node_online_map); + } +} + +#ifdef CONFIG_NUMA_EMU +static unsigned int __initdata numa_fake; + +/* Numa emulation */ +static int __init numa_emulation(unsigned long start_pfn, + unsigned long end_pfn) +{ + int ret; + unsigned int i; + struct node nodes[MAX_NUMNODES]; + uint64_t sz = pfn_to_paddr(end_pfn - start_pfn) / numa_fake; + + /* Kludge needed for the hash function */ + if ( hweight64(sz) > 1 ) + { + uint64_t x = 1; + + while ( (x << 1) < sz ) + x <<= 1; + if ( x < sz / 2 ) + printk(KERN_ERR "Numa emulation unbalanced. Complain to maintainer\n"); + sz = x; + } + + memset(&nodes, 0, sizeof(nodes)); + for ( i = 0; i < numa_fake; i++ ) + { + nodes[i].start = pfn_to_paddr(start_pfn) + i * sz; + + if ( i == numa_fake - 1 ) + sz = pfn_to_paddr(end_pfn) - nodes[i].start; + + nodes[i].end = nodes[i].start + sz; + printk(KERN_INFO "Faking node %u at %"PRIx64"-%"PRIx64" (%"PRIu64"MB)\n", + i, nodes[i].start, nodes[i].end, + (nodes[i].end - nodes[i].start) >> 20); + node_set_online(i); + } + + ret = compute_hash_shift(nodes, numa_fake, NULL); + if ( ret < 0 ) + { + printk(KERN_ERR "No NUMA hash function found. Emulation disabled.\n"); + return -1; + } + memnode_shift = ret; + + for_each_online_node ( i ) + setup_node_bootmem(i, nodes[i].start, nodes[i].end); + + numa_init_array(); + + return 0; +} +#endif + +void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn) +{ + unsigned int i; + paddr_t start = pfn_to_paddr(start_pfn); + paddr_t end = pfn_to_paddr(end_pfn); + +#ifdef CONFIG_NUMA_EMU + if ( numa_fake && !numa_emulation(start_pfn, end_pfn) ) + return; +#endif + +#ifdef CONFIG_NUMA + if ( !numa_off && !numa_process_nodes(start, end) ) + return; +#endif + + printk(KERN_INFO "%s\n", + numa_off ? "NUMA turned off" : "No NUMA configuration found"); + + printk(KERN_INFO "Faking a node at %"PRIpaddr"-%"PRIpaddr"\n", + start, end); + + /* Setup dummy node covering all memory */ + memnode_shift = BITS_PER_LONG - 1; + memnodemap = _memnodemap; + + /* Dummy node only uses 1 slot in reality */ + memnodemap[0] = 0; + memnodemapsize = 1; + + nodes_clear(node_online_map); + node_set_online(0); + for ( i = 0; i < nr_cpu_ids; i++ ) + numa_set_node(i, 0); + + cpumask_copy(&node_to_cpumask[0], cpumask_of(0)); + setup_node_bootmem(0, start, end); +} + +void numa_add_cpu(unsigned int cpu) +{ + cpumask_set_cpu(cpu, &node_to_cpumask[cpu_to_node(cpu)]); +} + +void numa_set_node(unsigned int cpu, nodeid_t node) +{ + cpu_to_node[cpu] = node; +} + +/* [numa=off] */ +static int __init cf_check numa_setup(const char *opt) +{ + if ( !strncmp(opt, "off", 3) ) + numa_off = true; + else if ( !strncmp(opt, "on", 2) ) + numa_off = false; +#ifdef CONFIG_NUMA_EMU + else if ( !strncmp(opt, "fake=", 5) ) + { + numa_off = false; + numa_fake = simple_strtoul(opt + 5, NULL, 0); + if ( numa_fake >= MAX_NUMNODES ) + numa_fake = MAX_NUMNODES; + } +#endif + else + return arch_numa_setup(opt); + + return 0; +} +custom_param("numa", numa_setup); + +static void cf_check dump_numa(unsigned char key) +{ + s_time_t now = NOW(); + unsigned int i, j, n; + struct domain *d; + const struct page_info *page; + unsigned int page_num_node[MAX_NUMNODES]; + const struct vnuma_info *vnuma; + + printk("'%c' pressed -> dumping numa info (now = %"PRI_stime")\n", key, + now); + + for_each_online_node ( i ) + { + paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1); + + printk("NODE%u start->%lu size->%lu free->%lu\n", + i, node_start_pfn(i), node_spanned_pages(i), + avail_node_heap_pages(i)); + /* Sanity check phys_to_nid() */ + if ( phys_to_nid(pa) != i ) + printk("phys_to_nid(%"PRIpaddr") -> %d should be %u\n", + pa, phys_to_nid(pa), i); + } + + j = cpumask_first(&cpu_online_map); + n = 0; + for_each_online_cpu ( i ) + { + if ( i != j + n || cpu_to_node[j] != cpu_to_node[i] ) + { + if ( n > 1 ) + printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]); + else + printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]); + j = i; + n = 1; + } + else + ++n; + } + if ( n > 1 ) + printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]); + else + printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]); + + rcu_read_lock(&domlist_read_lock); + + printk("Memory location of each domain:\n"); + for_each_domain ( d ) + { + process_pending_softirqs(); + + printk("Domain %u (total: %u):\n", d->domain_id, domain_tot_pages(d)); + + for_each_online_node ( i ) + page_num_node[i] = 0; + + spin_lock(&d->page_alloc_lock); + page_list_for_each ( page, &d->page_list ) + { + i = phys_to_nid(page_to_maddr(page)); + page_num_node[i]++; + } + spin_unlock(&d->page_alloc_lock); + + for_each_online_node ( i ) + printk(" Node %u: %u\n", i, page_num_node[i]); + + if ( !read_trylock(&d->vnuma_rwlock) ) + continue; + + if ( !d->vnuma ) + { + read_unlock(&d->vnuma_rwlock); + continue; + } + + vnuma = d->vnuma; + printk(" %u vnodes, %u vcpus, guest physical layout:\n", + vnuma->nr_vnodes, d->max_vcpus); + for ( i = 0; i < vnuma->nr_vnodes; i++ ) + { + unsigned int start_cpu = ~0U; + + if ( vnuma->vnode_to_pnode[i] == NUMA_NO_NODE ) + printk(" %3u: pnode ???,", i); + else + printk(" %3u: pnode %3u,", i, vnuma->vnode_to_pnode[i]); + + printk(" vcpus "); + + for ( j = 0; j < d->max_vcpus; j++ ) + { + if ( !(j & 0x3f) ) + process_pending_softirqs(); + + if ( vnuma->vcpu_to_vnode[j] == i ) + { + if ( start_cpu == ~0U ) + { + printk("%d", j); + start_cpu = j; + } + } + else if ( start_cpu != ~0U ) + { + if ( j - 1 != start_cpu ) + printk("-%d ", j - 1); + else + printk(" "); + start_cpu = ~0U; + } + } + + if ( start_cpu != ~0U && start_cpu != j - 1 ) + printk("-%d", j - 1); + + printk("\n"); + + for ( j = 0; j < vnuma->nr_vmemranges; j++ ) + { + if ( vnuma->vmemrange[j].nid == i ) + printk(" %016"PRIx64" - %016"PRIx64"\n", + vnuma->vmemrange[j].start, + vnuma->vmemrange[j].end); + } + } + + read_unlock(&d->vnuma_rwlock); + } + + rcu_read_unlock(&domlist_read_lock); +} + +static int __init cf_check register_numa_trigger(void) +{ + register_keyhandler('u', dump_numa, "dump NUMA info", 1); + return 0; +} +__initcall(register_numa_trigger); diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h index 7aef1a88dc..d799078a7a 100644 --- a/xen/include/xen/numa.h +++ b/xen/include/xen/numa.h @@ -18,4 +18,71 @@ (((d)->vcpu != NULL && (d)->vcpu[0] != NULL) \ ? vcpu_to_node((d)->vcpu[0]) : NUMA_NO_NODE) +/* The following content can be used when NUMA feature is enabled */ +#ifdef CONFIG_NUMA + +extern nodeid_t cpu_to_node[NR_CPUS]; +extern cpumask_t node_to_cpumask[]; + +#define cpu_to_node(cpu) cpu_to_node[cpu] +#define parent_node(node) (node) +#define node_to_first_cpu(node) __ffs(node_to_cpumask[node]) +#define node_to_cpumask(node) node_to_cpumask[node] + +struct node { + paddr_t start, end; +}; + +extern int compute_hash_shift(const struct node *nodes, + unsigned int numnodes, const nodeid_t *nodeids); + +#define VIRTUAL_BUG_ON(x) + +extern bool numa_off; + +extern void numa_add_cpu(unsigned int cpu); +extern void numa_init_array(void); +extern void numa_set_node(unsigned int cpu, nodeid_t node); +extern void numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn); +extern int numa_process_nodes(paddr_t start, paddr_t end); + +extern int arch_numa_setup(const char *opt); +extern bool arch_numa_broken(void); +extern void setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end); + +static inline void clear_node_cpumask(unsigned int cpu) +{ + cpumask_clear_cpu(cpu, &node_to_cpumask[cpu_to_node(cpu)]); +} + +/* Simple perfect hash to map pdx to node numbers */ +extern unsigned int memnode_shift; +extern unsigned long memnodemapsize; +extern uint8_t *memnodemap; + +struct node_data { + unsigned long node_start_pfn; + unsigned long node_spanned_pages; +}; + +extern struct node_data node_data[]; + +static inline nodeid_t __attribute_pure__ phys_to_nid(paddr_t addr) +{ + nodeid_t nid; + VIRTUAL_BUG_ON((paddr_to_pdx(addr) >> memnode_shift) >= memnodemapsize); + nid = memnodemap[paddr_to_pdx(addr) >> memnode_shift]; + VIRTUAL_BUG_ON(nid >= MAX_NUMNODES || !node_data[nid]); + return nid; +} + +#define NODE_DATA(nid) (&node_data[nid]) + +#define node_start_pfn(nid) (NODE_DATA(nid)->node_start_pfn) +#define node_spanned_pages(nid) (NODE_DATA(nid)->node_spanned_pages) +#define node_end_pfn(nid) (NODE_DATA(nid)->node_start_pfn + \ + NODE_DATA(nid)->node_spanned_pages) + +#endif + #endif /* _XEN_NUMA_H */
There are some codes in x86/numa.c can be shared by common architectures to implememnt NUMA support. Just like some variables and functions to check and store NUMA memory map. And some variables and functions to do NUMA initialization. In this patch, we move them to common/numa.c and xen/numa.h and use the CONFIG_NUMA to gate them for non-NUMA supported architectures. As the target header file is Xen-style, so we trim some spaces and replace tabs for the codes that has been moved to xen/numa.h at the same time. As acpi_scan_nodes has been used in a common function, it doesn't make sense to use acpi_xxx in common code, so we rename it to numa_scan_nodes in this patch too. After that if we still use CONFIG_ACPI_NUMA in to gate numa_scan_nodes in numa_initmem_init, that doesn't make sense. As CONFIG_NUMA will be selected by CONFIG_ACPI_NUMA for x86. So, we replace CONFIG_ACPI_NUMA by CONFIG_NUMA to gate numa_scan_nodes. As arch_numa_disabled has been implememnted for ACPI NUMA, we can rename srat_disabled to numa_disabled and move it to common code as well. Signed-off-by: Wei Chen <wei.chen@arm.com> --- v4 -> v5: 1. Use nodeid_t instead of uint8_t for memnodemap. 2. Restore to use typeof(*memnodemap) for _memnodemap, this will avoid the further adjustments for _memnodemap's type. 3. Use __ro_after_init for numa_off. 4. Use pointer-to-const for proper function parameters. 5. Use unsigned int for variables that are not realy used for node ID. 6. Fix code comments code-style and adjust the length. 7. Fix code-styles. 8. Rename numa_scan_nodes to numa_process_nodes. 9. Use a plain "int ret" to record compute_hash_shift return value. v3 -> v4: 1. Restore compute_hash_shift's return value to int. 2. Remove unnecessary parentheses for macros. 3. Use unsigned int for proper variables. 4. Fix some code-style. v2 -> v3: 1. Remove acpi.h from common/numa.c. 2. Rename acpi_scan_nodes to numa_scan_nodes. 3. Replace u8 by uint8_t for memnodemap. 4. Use unsigned int for memnode_shift and adjust related functions (compute_hash_shift, populate_memnodemap) to use correct types for return values or parameters. 5. Use nodeid_t for nodeid and node numbers. 6. Use __read_mostly and __ro_after_init for appropriate variables. 7. Adjust the __read_mostly and __initdata location for some variables. 8. convert from plain int to unsigned for cpuid and other proper variables. 9. Use __attribute_pure__ instead of __attribute__((pure)). 10. Replace CONFIG_ACPI_NUMA by CONFIG_NUMA in numa_initmem_init. 11. Add const for some functions' parameters. 12. Move srat_disabled to common code with new name numa_disabled. 13. Fix some spaces code-style for numa_emulation. 14. Change from int to unsigned int for numa_fake. v1 -> v2: 1. New patch in v2. --- xen/arch/x86/include/asm/acpi.h | 1 - xen/arch/x86/include/asm/numa.h | 57 +--- xen/arch/x86/include/asm/setup.h | 1 - xen/arch/x86/numa.c | 430 +---------------------------- xen/arch/x86/smpboot.c | 2 +- xen/arch/x86/srat.c | 10 +- xen/common/Makefile | 1 + xen/common/numa.c | 460 +++++++++++++++++++++++++++++++ xen/include/xen/numa.h | 67 +++++ 9 files changed, 536 insertions(+), 493 deletions(-) create mode 100644 xen/common/numa.c