Message ID | adb17b3a-f57d-0a34-73e5-6f32ec60ab5e@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86/NUMA: correct off-by-1 in node map size calculation | expand |
On Tue, Sep 27, 2022 at 04:14:21PM +0200, Jan Beulich wrote: > extract_lsb_from_nodes() accumulates "memtop" from all PDXes one past > the covered ranges. Hence the maximum address which can validly by used > to index the node map is one below this value, and we may currently set > up a node map with an unused (and never initialized) trailing entry. In > boundary cases this may also mean we dynamically allocate a page when > the static (64-entry) map would suffice. > > While there also correct the comment ahead of the function, for it to > match the actual code: Linux commit 54413927f022 ("x86-64: > x86_64-make-the-numa-hash-function-nodemap-allocation fix fix") removed > the ORing in of the end address before we actually cloned their code. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> > --- > Really the shift value may end up needlessly small when there's > discontiguous memory. Within a gap, any address can be taken for the > node boundary, and hence neither the end of the lower range nor the > start of the higher range necessarily is the best address to use. For > example with these two node ranges (numbers are frame addresses) > > [10000,17fff] > [28000,2ffff] > > we'd calculate the shift as 15 when 16 or even 17 (because the start of > the 1st range can also be ignored) would do. I haven't tried to properly > prove it yet, but it looks to me as if the top bit of the XOR of lower > range (inclusive) end and higher range start would be what would want > accumulating (of course requiring the entries to be sorted, or to be > processed in address order). This would then "naturally" exclude lowest > range start and highest range end. I'm not familiar with the logic in the NUMA code, seems like a possible optimization. It might be good to include in which way a bigger shift is beneficial. Thanks, Roger.
Hi Jan, > -----Original Message----- > From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan > Beulich > Sent: 2022年9月27日 22:14 > To: xen-devel@lists.xenproject.org > Cc: Andrew Cooper <andrew.cooper3@citrix.com>; Wei Liu <wl@xen.org>; Roger > Pau Monné <roger.pau@citrix.com> > Subject: [PATCH] x86/NUMA: correct off-by-1 in node map size calculation > > extract_lsb_from_nodes() accumulates "memtop" from all PDXes one past > the covered ranges. Hence the maximum address which can validly by used > to index the node map is one below this value, and we may currently set > up a node map with an unused (and never initialized) trailing entry. In > boundary cases this may also mean we dynamically allocate a page when > the static (64-entry) map would suffice. > > While there also correct the comment ahead of the function, for it to > match the actual code: Linux commit 54413927f022 ("x86-64: > x86_64-make-the-numa-hash-function-nodemap-allocation fix fix") removed > the ORing in of the end address before we actually cloned their code. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> > --- > Really the shift value may end up needlessly small when there's > discontiguous memory. Within a gap, any address can be taken for the > node boundary, and hence neither the end of the lower range nor the > start of the higher range necessarily is the best address to use. For > example with these two node ranges (numbers are frame addresses) > > [10000,17fff] > [28000,2ffff] > > we'd calculate the shift as 15 when 16 or even 17 (because the start of > the 1st range can also be ignored) would do. I haven't tried to properly > prove it yet, but it looks to me as if the top bit of the XOR of lower > range (inclusive) end and higher range start would be what would want > accumulating (of course requiring the entries to be sorted, or to be > processed in address order). This would then "naturally" exclude lowest > range start and highest range end. > > --- a/xen/arch/x86/numa.c > +++ b/xen/arch/x86/numa.c > @@ -110,7 +110,7 @@ static int __init allocate_cachealigned_ > } > > /* > - * The LSB of all start and end addresses in the node map is the value of > the > + * The LSB of all start addresses in the node map is the value of the > * maximum possible shift. > */ > static int __init extract_lsb_from_nodes(const struct node *nodes, > @@ -135,7 +135,7 @@ static int __init extract_lsb_from_nodes > i = BITS_PER_LONG - 1; > else > i = find_first_bit(&bitfield, sizeof(unsigned long)*8); > - memnodemapsize = (memtop >> i) + 1; > + memnodemapsize = ((memtop - 1) >> i) + 1; > return i; > } > Thanks for this fix. Reviewed-by: Wei Chen <Wei.Chen@arm.com>
--- a/xen/arch/x86/numa.c +++ b/xen/arch/x86/numa.c @@ -110,7 +110,7 @@ static int __init allocate_cachealigned_ } /* - * The LSB of all start and end addresses in the node map is the value of the + * The LSB of all start addresses in the node map is the value of the * maximum possible shift. */ static int __init extract_lsb_from_nodes(const struct node *nodes, @@ -135,7 +135,7 @@ static int __init extract_lsb_from_nodes i = BITS_PER_LONG - 1; else i = find_first_bit(&bitfield, sizeof(unsigned long)*8); - memnodemapsize = (memtop >> i) + 1; + memnodemapsize = ((memtop - 1) >> i) + 1; return i; }
extract_lsb_from_nodes() accumulates "memtop" from all PDXes one past the covered ranges. Hence the maximum address which can validly by used to index the node map is one below this value, and we may currently set up a node map with an unused (and never initialized) trailing entry. In boundary cases this may also mean we dynamically allocate a page when the static (64-entry) map would suffice. While there also correct the comment ahead of the function, for it to match the actual code: Linux commit 54413927f022 ("x86-64: x86_64-make-the-numa-hash-function-nodemap-allocation fix fix") removed the ORing in of the end address before we actually cloned their code. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- Really the shift value may end up needlessly small when there's discontiguous memory. Within a gap, any address can be taken for the node boundary, and hence neither the end of the lower range nor the start of the higher range necessarily is the best address to use. For example with these two node ranges (numbers are frame addresses) [10000,17fff] [28000,2ffff] we'd calculate the shift as 15 when 16 or even 17 (because the start of the 1st range can also be ignored) would do. I haven't tried to properly prove it yet, but it looks to me as if the top bit of the XOR of lower range (inclusive) end and higher range start would be what would want accumulating (of course requiring the entries to be sorted, or to be processed in address order). This would then "naturally" exclude lowest range start and highest range end.