Message ID | 20181128140136.GG10377@bombadil.infradead.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Number of arguments in vmalloc.c | expand |
On 11/28/18 3:01 PM, Matthew Wilcox wrote: > > Some of the functions in vmalloc.c have as many as nine arguments. > So I thought I'd have a quick go at bundling the ones that make sense > into a struct and pass around a pointer to that struct. Well, it made > the generated code worse, Worse in which metric? > so I thought I'd share my attempt so nobody > else bothers (or soebody points out that I did something stupid). I guess in some of the functions the args parameter could be const? Might make some difference. Anyway this shouldn't be a fast path, so even if the generated code is e.g. somewhat larger, then it still might make sense to reduce the insane parameter lists. > I tried a few variations on this theme; bundling gfp_t and node into > the struct made it even worse, as did adding caller and vm_flags. This > is the least bad version. > > (Yes, the naming is bad; I'm not tidying this up for submission, I'm > showing an experiment that didn't work). > > Nacked-by: Matthew Wilcox <willy@infradead.org> >
On Mon, Dec 03, 2018 at 02:59:36PM +0100, Vlastimil Babka wrote: > On 11/28/18 3:01 PM, Matthew Wilcox wrote: > > > > Some of the functions in vmalloc.c have as many as nine arguments. > > So I thought I'd have a quick go at bundling the ones that make sense > > into a struct and pass around a pointer to that struct. Well, it made > > the generated code worse, > > Worse in which metric? More instructions to accomplish the same thing. > > so I thought I'd share my attempt so nobody > > else bothers (or soebody points out that I did something stupid). > > I guess in some of the functions the args parameter could be const? > Might make some difference. > > Anyway this shouldn't be a fast path, so even if the generated code is > e.g. somewhat larger, then it still might make sense to reduce the > insane parameter lists. It might ... I'm not sure it's even easier to program than the original though.
> On Dec 3, 2018, at 8:13 AM, Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Dec 03, 2018 at 02:59:36PM +0100, Vlastimil Babka wrote: >> On 11/28/18 3:01 PM, Matthew Wilcox wrote: >>> Some of the functions in vmalloc.c have as many as nine arguments. >>> So I thought I'd have a quick go at bundling the ones that make sense >>> into a struct and pass around a pointer to that struct. Well, it made >>> the generated code worse, >> >> Worse in which metric? > > More instructions to accomplish the same thing. > >>> so I thought I'd share my attempt so nobody >>> else bothers (or soebody points out that I did something stupid). >> >> I guess in some of the functions the args parameter could be const? >> Might make some difference. >> >> Anyway this shouldn't be a fast path, so even if the generated code is >> e.g. somewhat larger, then it still might make sense to reduce the >> insane parameter lists. > > It might ... I'm not sure it's even easier to program than the original > though. My intuition is that if all the fields of vm_args were initialized together (in the same function), and a 'const struct vm_args *' was provided as an argument to other functions, code would be better (at least better than what you got right now). I’m not saying it is easily applicable in this use-case (since I didn’t check).
On Mon, Dec 03, 2018 at 02:04:41PM -0800, Nadav Amit wrote: > On Dec 3, 2018, at 8:13 AM, Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Dec 03, 2018 at 02:59:36PM +0100, Vlastimil Babka wrote: > >> On 11/28/18 3:01 PM, Matthew Wilcox wrote: > >>> Some of the functions in vmalloc.c have as many as nine arguments. > >>> So I thought I'd have a quick go at bundling the ones that make sense > >>> into a struct and pass around a pointer to that struct. Well, it made > >>> the generated code worse, > >> > >> Worse in which metric? > > > > More instructions to accomplish the same thing. > > > >>> so I thought I'd share my attempt so nobody > >>> else bothers (or soebody points out that I did something stupid). > >> > >> I guess in some of the functions the args parameter could be const? > >> Might make some difference. > >> > >> Anyway this shouldn't be a fast path, so even if the generated code is > >> e.g. somewhat larger, then it still might make sense to reduce the > >> insane parameter lists. > > > > It might ... I'm not sure it's even easier to program than the original > > though. > > My intuition is that if all the fields of vm_args were initialized together > (in the same function), and a 'const struct vm_args *' was provided as > an argument to other functions, code would be better (at least better than > what you got right now). > > I’m not saying it is easily applicable in this use-case (since I didn’t > check). Your intuition is wrong ... text data bss dec hex filename 9466 81 32 9579 256b before.o 9546 81 32 9659 25bb .build-tiny/mm/vmalloc.o 9546 81 32 9659 25bb const.o indeed, there's no difference between with or without the const, according to 'cmp'. Now, only alloc_vmap_area() gets to take a const argument. __get_vm_area_node() intentionally modifies the arguments. But feel free to play around with this; you might be able to make it do something worthwhile.
> On Dec 3, 2018, at 2:49 PM, Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Dec 03, 2018 at 02:04:41PM -0800, Nadav Amit wrote: >> On Dec 3, 2018, at 8:13 AM, Matthew Wilcox <willy@infradead.org> wrote: >>> On Mon, Dec 03, 2018 at 02:59:36PM +0100, Vlastimil Babka wrote: >>>> On 11/28/18 3:01 PM, Matthew Wilcox wrote: >>>>> Some of the functions in vmalloc.c have as many as nine arguments. >>>>> So I thought I'd have a quick go at bundling the ones that make sense >>>>> into a struct and pass around a pointer to that struct. Well, it made >>>>> the generated code worse, >>>> >>>> Worse in which metric? >>> >>> More instructions to accomplish the same thing. >>> >>>>> so I thought I'd share my attempt so nobody >>>>> else bothers (or soebody points out that I did something stupid). >>>> >>>> I guess in some of the functions the args parameter could be const? >>>> Might make some difference. >>>> >>>> Anyway this shouldn't be a fast path, so even if the generated code is >>>> e.g. somewhat larger, then it still might make sense to reduce the >>>> insane parameter lists. >>> >>> It might ... I'm not sure it's even easier to program than the original >>> though. >> >> My intuition is that if all the fields of vm_args were initialized together >> (in the same function), and a 'const struct vm_args *' was provided as >> an argument to other functions, code would be better (at least better than >> what you got right now). >> >> I’m not saying it is easily applicable in this use-case (since I didn’t >> check). > > Your intuition is wrong ... > > text data bss dec hex filename > 9466 81 32 9579 256b before.o > 9546 81 32 9659 25bb .build-tiny/mm/vmalloc.o > 9546 81 32 9659 25bb const.o > > indeed, there's no difference between with or without the const, according > to 'cmp'. > > Now, only alloc_vmap_area() gets to take a const argument. > __get_vm_area_node() intentionally modifies the arguments. But feel > free to play around with this; you might be able to make it do something > worthwhile. I was playing with it (a bit). What I suggested (modifying __get_vm_area_node() so it will not change arguments) helps a bit, but not much. One insight that I got is that at least part of the overhead comes from the the stack protector code that gcc emits.
> On Dec 3, 2018, at 7:12 PM, Nadav Amit <nadav.amit@gmail.com> wrote: > >> On Dec 3, 2018, at 2:49 PM, Matthew Wilcox <willy@infradead.org> wrote: >> >> On Mon, Dec 03, 2018 at 02:04:41PM -0800, Nadav Amit wrote: >>> On Dec 3, 2018, at 8:13 AM, Matthew Wilcox <willy@infradead.org> wrote: >>>> On Mon, Dec 03, 2018 at 02:59:36PM +0100, Vlastimil Babka wrote: >>>>> On 11/28/18 3:01 PM, Matthew Wilcox wrote: >>>>>> Some of the functions in vmalloc.c have as many as nine arguments. >>>>>> So I thought I'd have a quick go at bundling the ones that make sense >>>>>> into a struct and pass around a pointer to that struct. Well, it made >>>>>> the generated code worse, >>>>> >>>>> Worse in which metric? >>>> >>>> More instructions to accomplish the same thing. >>>> >>>>>> so I thought I'd share my attempt so nobody >>>>>> else bothers (or soebody points out that I did something stupid). >>>>> >>>>> I guess in some of the functions the args parameter could be const? >>>>> Might make some difference. >>>>> >>>>> Anyway this shouldn't be a fast path, so even if the generated code is >>>>> e.g. somewhat larger, then it still might make sense to reduce the >>>>> insane parameter lists. >>>> >>>> It might ... I'm not sure it's even easier to program than the original >>>> though. >>> >>> My intuition is that if all the fields of vm_args were initialized together >>> (in the same function), and a 'const struct vm_args *' was provided as >>> an argument to other functions, code would be better (at least better than >>> what you got right now). >>> >>> I’m not saying it is easily applicable in this use-case (since I didn’t >>> check). >> >> Your intuition is wrong ... >> >> text data bss dec hex filename >> 9466 81 32 9579 256b before.o >> 9546 81 32 9659 25bb .build-tiny/mm/vmalloc.o >> 9546 81 32 9659 25bb const.o >> >> indeed, there's no difference between with or without the const, according >> to 'cmp'. >> >> Now, only alloc_vmap_area() gets to take a const argument. >> __get_vm_area_node() intentionally modifies the arguments. But feel >> free to play around with this; you might be able to make it do something >> worthwhile. > > I was playing with it (a bit). What I suggested (modifying > __get_vm_area_node() so it will not change arguments) helps a bit, but not > much. > > One insight that I got is that at least part of the overhead comes from the > the stack protector code that gcc emits. [ +Peter ] So I dug some more (I’m still not done), and found various trivial things (e.g., storing zero extending u32 immediate is shorter for registers, inlining already takes place). *But* there is one thing that may require some attention - patch b59167ac7bafd ("x86/percpu: Fix this_cpu_read()”) set ordering constraints on the VM_ARGS() evaluation. And this patch also imposes, it appears, (unnecessary) constraints on other pieces of code. These constraints are due to the addition of the volatile keyword for this_cpu_read() by the patch. This affects at least 68 functions in my kernel build, some of which are hot (I think), e.g., finish_task_switch(), smp_x86_platform_ipi() and select_idle_sibling(). Peter, perhaps the solution was too big of a hammer? Is it possible instead to create a separate "this_cpu_read_once()” with the volatile keyword? Such a function can be used for native_sched_clock() and other seqlocks, etc.
On Thu, Dec 06, 2018 at 12:28:26AM -0800, Nadav Amit wrote: > [ +Peter ] > > So I dug some more (I’m still not done), and found various trivial things > (e.g., storing zero extending u32 immediate is shorter for registers, > inlining already takes place). > > *But* there is one thing that may require some attention - patch > b59167ac7bafd ("x86/percpu: Fix this_cpu_read()”) set ordering constraints > on the VM_ARGS() evaluation. And this patch also imposes, it appears, > (unnecessary) constraints on other pieces of code. > > These constraints are due to the addition of the volatile keyword for > this_cpu_read() by the patch. This affects at least 68 functions in my > kernel build, some of which are hot (I think), e.g., finish_task_switch(), > smp_x86_platform_ipi() and select_idle_sibling(). > > Peter, perhaps the solution was too big of a hammer? Is it possible instead > to create a separate "this_cpu_read_once()” with the volatile keyword? Such > a function can be used for native_sched_clock() and other seqlocks, etc. No. like the commit writes this_cpu_read() _must_ imply READ_ONCE(). If you want something else, use something else, there's plenty other options available. There's this_cpu_op_stable(), but also __this_cpu_read() and raw_this_cpu_read() (which currently don't differ from this_cpu_read() but could).
On Thu, Dec 06, 2018 at 11:25:59AM +0100, Peter Zijlstra wrote: > On Thu, Dec 06, 2018 at 12:28:26AM -0800, Nadav Amit wrote: > > [ +Peter ] > > > > So I dug some more (I’m still not done), and found various trivial things > > (e.g., storing zero extending u32 immediate is shorter for registers, > > inlining already takes place). > > > > *But* there is one thing that may require some attention - patch > > b59167ac7bafd ("x86/percpu: Fix this_cpu_read()”) set ordering constraints > > on the VM_ARGS() evaluation. And this patch also imposes, it appears, > > (unnecessary) constraints on other pieces of code. > > > > These constraints are due to the addition of the volatile keyword for > > this_cpu_read() by the patch. This affects at least 68 functions in my > > kernel build, some of which are hot (I think), e.g., finish_task_switch(), > > smp_x86_platform_ipi() and select_idle_sibling(). > > > > Peter, perhaps the solution was too big of a hammer? Is it possible instead > > to create a separate "this_cpu_read_once()” with the volatile keyword? Such > > a function can be used for native_sched_clock() and other seqlocks, etc. > > No. like the commit writes this_cpu_read() _must_ imply READ_ONCE(). If > you want something else, use something else, there's plenty other > options available. The thing is, the this_cpu_*() things are defined IRQ-safe, this means the values are subject to change from IRQs, and thus must be reloaded. Also (the generic form): local_irq_save() __this_cpu_read() local_irq_restore() would not be allowed to be munged like that. Which raises the point that percpu_from_op() and the other also need that volatile. __this_cpu_*() OTOH assumes external preempt/IRQ disabling and could thus be allowed to move crud about. Which would suggest the following --- arch/x86/include/asm/percpu.h | 224 +++++++++++++++++++++--------------------- 1 file changed, 112 insertions(+), 112 deletions(-) diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index 1a19d11cfbbd..f75ccccd71aa 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -87,7 +87,7 @@ * don't give an lvalue though). */ extern void __bad_percpu_size(void); -#define percpu_to_op(op, var, val) \ +#define percpu_to_op(qual, op, var, val) \ do { \ typedef typeof(var) pto_T__; \ if (0) { \ @@ -97,22 +97,22 @@ do { \ } \ switch (sizeof(var)) { \ case 1: \ - asm(op "b %1,"__percpu_arg(0) \ + asm qual (op "b %1,"__percpu_arg(0) \ : "+m" (var) \ : "qi" ((pto_T__)(val))); \ break; \ case 2: \ - asm(op "w %1,"__percpu_arg(0) \ + asm qual (op "w %1,"__percpu_arg(0) \ : "+m" (var) \ : "ri" ((pto_T__)(val))); \ break; \ case 4: \ - asm(op "l %1,"__percpu_arg(0) \ + asm qual (op "l %1,"__percpu_arg(0) \ : "+m" (var) \ : "ri" ((pto_T__)(val))); \ break; \ case 8: \ - asm(op "q %1,"__percpu_arg(0) \ + asm qual (op "q %1,"__percpu_arg(0) \ : "+m" (var) \ : "re" ((pto_T__)(val))); \ break; \ @@ -124,7 +124,7 @@ do { \ * Generate a percpu add to memory instruction and optimize code * if one is added or subtracted. */ -#define percpu_add_op(var, val) \ +#define percpu_add_op(qual, var, val) \ do { \ typedef typeof(var) pao_T__; \ const int pao_ID__ = (__builtin_constant_p(val) && \ @@ -138,41 +138,41 @@ do { \ switch (sizeof(var)) { \ case 1: \ if (pao_ID__ == 1) \ - asm("incb "__percpu_arg(0) : "+m" (var)); \ + asm qual ("incb "__percpu_arg(0) : "+m" (var)); \ else if (pao_ID__ == -1) \ - asm("decb "__percpu_arg(0) : "+m" (var)); \ + asm qual ("decb "__percpu_arg(0) : "+m" (var)); \ else \ - asm("addb %1, "__percpu_arg(0) \ + asm qual ("addb %1, "__percpu_arg(0) \ : "+m" (var) \ : "qi" ((pao_T__)(val))); \ break; \ case 2: \ if (pao_ID__ == 1) \ - asm("incw "__percpu_arg(0) : "+m" (var)); \ + asm qual ("incw "__percpu_arg(0) : "+m" (var)); \ else if (pao_ID__ == -1) \ - asm("decw "__percpu_arg(0) : "+m" (var)); \ + asm qual ("decw "__percpu_arg(0) : "+m" (var)); \ else \ - asm("addw %1, "__percpu_arg(0) \ + asm qual ("addw %1, "__percpu_arg(0) \ : "+m" (var) \ : "ri" ((pao_T__)(val))); \ break; \ case 4: \ if (pao_ID__ == 1) \ - asm("incl "__percpu_arg(0) : "+m" (var)); \ + asm qual ("incl "__percpu_arg(0) : "+m" (var)); \ else if (pao_ID__ == -1) \ - asm("decl "__percpu_arg(0) : "+m" (var)); \ + asm qual ("decl "__percpu_arg(0) : "+m" (var)); \ else \ - asm("addl %1, "__percpu_arg(0) \ + asm qual ("addl %1, "__percpu_arg(0) \ : "+m" (var) \ : "ri" ((pao_T__)(val))); \ break; \ case 8: \ if (pao_ID__ == 1) \ - asm("incq "__percpu_arg(0) : "+m" (var)); \ + asm qual ("incq "__percpu_arg(0) : "+m" (var)); \ else if (pao_ID__ == -1) \ - asm("decq "__percpu_arg(0) : "+m" (var)); \ + asm qual ("decq "__percpu_arg(0) : "+m" (var)); \ else \ - asm("addq %1, "__percpu_arg(0) \ + asm qual ("addq %1, "__percpu_arg(0) \ : "+m" (var) \ : "re" ((pao_T__)(val))); \ break; \ @@ -180,27 +180,27 @@ do { \ } \ } while (0) -#define percpu_from_op(op, var) \ +#define percpu_from_op(qual, op, var) \ ({ \ typeof(var) pfo_ret__; \ switch (sizeof(var)) { \ case 1: \ - asm volatile(op "b "__percpu_arg(1)",%0"\ + asm qual (op "b "__percpu_arg(1)",%0" \ : "=q" (pfo_ret__) \ : "m" (var)); \ break; \ case 2: \ - asm volatile(op "w "__percpu_arg(1)",%0"\ + asm qual (op "w "__percpu_arg(1)",%0" \ : "=r" (pfo_ret__) \ : "m" (var)); \ break; \ case 4: \ - asm volatile(op "l "__percpu_arg(1)",%0"\ + asm qual (op "l "__percpu_arg(1)",%0" \ : "=r" (pfo_ret__) \ : "m" (var)); \ break; \ case 8: \ - asm volatile(op "q "__percpu_arg(1)",%0"\ + asm qual (op "q "__percpu_arg(1)",%0" \ : "=r" (pfo_ret__) \ : "m" (var)); \ break; \ @@ -238,23 +238,23 @@ do { \ pfo_ret__; \ }) -#define percpu_unary_op(op, var) \ +#define percpu_unary_op(qual, op, var) \ ({ \ switch (sizeof(var)) { \ case 1: \ - asm(op "b "__percpu_arg(0) \ + asm qual (op "b "__percpu_arg(0) \ : "+m" (var)); \ break; \ case 2: \ - asm(op "w "__percpu_arg(0) \ + asm qual (op "w "__percpu_arg(0) \ : "+m" (var)); \ break; \ case 4: \ - asm(op "l "__percpu_arg(0) \ + asm qual (op "l "__percpu_arg(0) \ : "+m" (var)); \ break; \ case 8: \ - asm(op "q "__percpu_arg(0) \ + asm qual (op "q "__percpu_arg(0) \ : "+m" (var)); \ break; \ default: __bad_percpu_size(); \ @@ -264,27 +264,27 @@ do { \ /* * Add return operation */ -#define percpu_add_return_op(var, val) \ +#define percpu_add_return_op(qual, var, val) \ ({ \ typeof(var) paro_ret__ = val; \ switch (sizeof(var)) { \ case 1: \ - asm("xaddb %0, "__percpu_arg(1) \ + asm qual ("xaddb %0, "__percpu_arg(1) \ : "+q" (paro_ret__), "+m" (var) \ : : "memory"); \ break; \ case 2: \ - asm("xaddw %0, "__percpu_arg(1) \ + asm qual ("xaddw %0, "__percpu_arg(1) \ : "+r" (paro_ret__), "+m" (var) \ : : "memory"); \ break; \ case 4: \ - asm("xaddl %0, "__percpu_arg(1) \ + asm qual ("xaddl %0, "__percpu_arg(1) \ : "+r" (paro_ret__), "+m" (var) \ : : "memory"); \ break; \ case 8: \ - asm("xaddq %0, "__percpu_arg(1) \ + asm qual ("xaddq %0, "__percpu_arg(1) \ : "+re" (paro_ret__), "+m" (var) \ : : "memory"); \ break; \ @@ -299,13 +299,13 @@ do { \ * expensive due to the implied lock prefix. The processor cannot prefetch * cachelines if xchg is used. */ -#define percpu_xchg_op(var, nval) \ +#define percpu_xchg_op(qual, var, nval) \ ({ \ typeof(var) pxo_ret__; \ typeof(var) pxo_new__ = (nval); \ switch (sizeof(var)) { \ case 1: \ - asm("\n\tmov "__percpu_arg(1)",%%al" \ + asm qual ("\n\tmov "__percpu_arg(1)",%%al" \ "\n1:\tcmpxchgb %2, "__percpu_arg(1) \ "\n\tjnz 1b" \ : "=&a" (pxo_ret__), "+m" (var) \ @@ -313,7 +313,7 @@ do { \ : "memory"); \ break; \ case 2: \ - asm("\n\tmov "__percpu_arg(1)",%%ax" \ + asm qual ("\n\tmov "__percpu_arg(1)",%%ax" \ "\n1:\tcmpxchgw %2, "__percpu_arg(1) \ "\n\tjnz 1b" \ : "=&a" (pxo_ret__), "+m" (var) \ @@ -321,7 +321,7 @@ do { \ : "memory"); \ break; \ case 4: \ - asm("\n\tmov "__percpu_arg(1)",%%eax" \ + asm qual ("\n\tmov "__percpu_arg(1)",%%eax" \ "\n1:\tcmpxchgl %2, "__percpu_arg(1) \ "\n\tjnz 1b" \ : "=&a" (pxo_ret__), "+m" (var) \ @@ -329,7 +329,7 @@ do { \ : "memory"); \ break; \ case 8: \ - asm("\n\tmov "__percpu_arg(1)",%%rax" \ + asm qual ("\n\tmov "__percpu_arg(1)",%%rax" \ "\n1:\tcmpxchgq %2, "__percpu_arg(1) \ "\n\tjnz 1b" \ : "=&a" (pxo_ret__), "+m" (var) \ @@ -345,32 +345,32 @@ do { \ * cmpxchg has no such implied lock semantics as a result it is much * more efficient for cpu local operations. */ -#define percpu_cmpxchg_op(var, oval, nval) \ +#define percpu_cmpxchg_op(qual, var, oval, nval) \ ({ \ typeof(var) pco_ret__; \ typeof(var) pco_old__ = (oval); \ typeof(var) pco_new__ = (nval); \ switch (sizeof(var)) { \ case 1: \ - asm("cmpxchgb %2, "__percpu_arg(1) \ + asm qual ("cmpxchgb %2, "__percpu_arg(1) \ : "=a" (pco_ret__), "+m" (var) \ : "q" (pco_new__), "0" (pco_old__) \ : "memory"); \ break; \ case 2: \ - asm("cmpxchgw %2, "__percpu_arg(1) \ + asm qual ("cmpxchgw %2, "__percpu_arg(1) \ : "=a" (pco_ret__), "+m" (var) \ : "r" (pco_new__), "0" (pco_old__) \ : "memory"); \ break; \ case 4: \ - asm("cmpxchgl %2, "__percpu_arg(1) \ + asm qual ("cmpxchgl %2, "__percpu_arg(1) \ : "=a" (pco_ret__), "+m" (var) \ : "r" (pco_new__), "0" (pco_old__) \ : "memory"); \ break; \ case 8: \ - asm("cmpxchgq %2, "__percpu_arg(1) \ + asm qual ("cmpxchgq %2, "__percpu_arg(1) \ : "=a" (pco_ret__), "+m" (var) \ : "r" (pco_new__), "0" (pco_old__) \ : "memory"); \ @@ -391,58 +391,58 @@ do { \ */ #define this_cpu_read_stable(var) percpu_stable_op("mov", var) -#define raw_cpu_read_1(pcp) percpu_from_op("mov", pcp) -#define raw_cpu_read_2(pcp) percpu_from_op("mov", pcp) -#define raw_cpu_read_4(pcp) percpu_from_op("mov", pcp) - -#define raw_cpu_write_1(pcp, val) percpu_to_op("mov", (pcp), val) -#define raw_cpu_write_2(pcp, val) percpu_to_op("mov", (pcp), val) -#define raw_cpu_write_4(pcp, val) percpu_to_op("mov", (pcp), val) -#define raw_cpu_add_1(pcp, val) percpu_add_op((pcp), val) -#define raw_cpu_add_2(pcp, val) percpu_add_op((pcp), val) -#define raw_cpu_add_4(pcp, val) percpu_add_op((pcp), val) -#define raw_cpu_and_1(pcp, val) percpu_to_op("and", (pcp), val) -#define raw_cpu_and_2(pcp, val) percpu_to_op("and", (pcp), val) -#define raw_cpu_and_4(pcp, val) percpu_to_op("and", (pcp), val) -#define raw_cpu_or_1(pcp, val) percpu_to_op("or", (pcp), val) -#define raw_cpu_or_2(pcp, val) percpu_to_op("or", (pcp), val) -#define raw_cpu_or_4(pcp, val) percpu_to_op("or", (pcp), val) -#define raw_cpu_xchg_1(pcp, val) percpu_xchg_op(pcp, val) -#define raw_cpu_xchg_2(pcp, val) percpu_xchg_op(pcp, val) -#define raw_cpu_xchg_4(pcp, val) percpu_xchg_op(pcp, val) - -#define this_cpu_read_1(pcp) percpu_from_op("mov", pcp) -#define this_cpu_read_2(pcp) percpu_from_op("mov", pcp) -#define this_cpu_read_4(pcp) percpu_from_op("mov", pcp) -#define this_cpu_write_1(pcp, val) percpu_to_op("mov", (pcp), val) -#define this_cpu_write_2(pcp, val) percpu_to_op("mov", (pcp), val) -#define this_cpu_write_4(pcp, val) percpu_to_op("mov", (pcp), val) -#define this_cpu_add_1(pcp, val) percpu_add_op((pcp), val) -#define this_cpu_add_2(pcp, val) percpu_add_op((pcp), val) -#define this_cpu_add_4(pcp, val) percpu_add_op((pcp), val) -#define this_cpu_and_1(pcp, val) percpu_to_op("and", (pcp), val) -#define this_cpu_and_2(pcp, val) percpu_to_op("and", (pcp), val) -#define this_cpu_and_4(pcp, val) percpu_to_op("and", (pcp), val) -#define this_cpu_or_1(pcp, val) percpu_to_op("or", (pcp), val) -#define this_cpu_or_2(pcp, val) percpu_to_op("or", (pcp), val) -#define this_cpu_or_4(pcp, val) percpu_to_op("or", (pcp), val) -#define this_cpu_xchg_1(pcp, nval) percpu_xchg_op(pcp, nval) -#define this_cpu_xchg_2(pcp, nval) percpu_xchg_op(pcp, nval) -#define this_cpu_xchg_4(pcp, nval) percpu_xchg_op(pcp, nval) - -#define raw_cpu_add_return_1(pcp, val) percpu_add_return_op(pcp, val) -#define raw_cpu_add_return_2(pcp, val) percpu_add_return_op(pcp, val) -#define raw_cpu_add_return_4(pcp, val) percpu_add_return_op(pcp, val) -#define raw_cpu_cmpxchg_1(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval) -#define raw_cpu_cmpxchg_2(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval) -#define raw_cpu_cmpxchg_4(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval) - -#define this_cpu_add_return_1(pcp, val) percpu_add_return_op(pcp, val) -#define this_cpu_add_return_2(pcp, val) percpu_add_return_op(pcp, val) -#define this_cpu_add_return_4(pcp, val) percpu_add_return_op(pcp, val) -#define this_cpu_cmpxchg_1(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval) -#define this_cpu_cmpxchg_2(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval) -#define this_cpu_cmpxchg_4(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval) +#define raw_cpu_read_1(pcp) percpu_from_op(, "mov", pcp) +#define raw_cpu_read_2(pcp) percpu_from_op(, "mov", pcp) +#define raw_cpu_read_4(pcp) percpu_from_op(, "mov", pcp) + +#define raw_cpu_write_1(pcp, val) percpu_to_op(, "mov", (pcp), val) +#define raw_cpu_write_2(pcp, val) percpu_to_op(, "mov", (pcp), val) +#define raw_cpu_write_4(pcp, val) percpu_to_op(, "mov", (pcp), val) +#define raw_cpu_add_1(pcp, val) percpu_add_op(, (pcp), val) +#define raw_cpu_add_2(pcp, val) percpu_add_op(, (pcp), val) +#define raw_cpu_add_4(pcp, val) percpu_add_op(, (pcp), val) +#define raw_cpu_and_1(pcp, val) percpu_to_op(, "and", (pcp), val) +#define raw_cpu_and_2(pcp, val) percpu_to_op(, "and", (pcp), val) +#define raw_cpu_and_4(pcp, val) percpu_to_op(, "and", (pcp), val) +#define raw_cpu_or_1(pcp, val) percpu_to_op(, "or", (pcp), val) +#define raw_cpu_or_2(pcp, val) percpu_to_op(, "or", (pcp), val) +#define raw_cpu_or_4(pcp, val) percpu_to_op(, "or", (pcp), val) +#define raw_cpu_xchg_1(pcp, val) percpu_xchg_op(, pcp, val) +#define raw_cpu_xchg_2(pcp, val) percpu_xchg_op(, pcp, val) +#define raw_cpu_xchg_4(pcp, val) percpu_xchg_op(, pcp, val) + +#define this_cpu_read_1(pcp) percpu_from_op(volatile, "mov", pcp) +#define this_cpu_read_2(pcp) percpu_from_op(volatile, "mov", pcp) +#define this_cpu_read_4(pcp) percpu_from_op(volatile, "mov", pcp) +#define this_cpu_write_1(pcp, val) percpu_to_op(volatile, "mov", (pcp), val) +#define this_cpu_write_2(pcp, val) percpu_to_op(volatile, "mov", (pcp), val) +#define this_cpu_write_4(pcp, val) percpu_to_op(volatile, "mov", (pcp), val) +#define this_cpu_add_1(pcp, val) percpu_add_op(volatile, (pcp), val) +#define this_cpu_add_2(pcp, val) percpu_add_op(volatile, (pcp), val) +#define this_cpu_add_4(pcp, val) percpu_add_op(volatile, (pcp), val) +#define this_cpu_and_1(pcp, val) percpu_to_op(volatile, "and", (pcp), val) +#define this_cpu_and_2(pcp, val) percpu_to_op(volatile, "and", (pcp), val) +#define this_cpu_and_4(pcp, val) percpu_to_op(volatile, "and", (pcp), val) +#define this_cpu_or_1(pcp, val) percpu_to_op(volatile, "or", (pcp), val) +#define this_cpu_or_2(pcp, val) percpu_to_op(volatile, "or", (pcp), val) +#define this_cpu_or_4(pcp, val) percpu_to_op(volatile, "or", (pcp), val) +#define this_cpu_xchg_1(pcp, nval) percpu_xchg_op(volatile, pcp, nval) +#define this_cpu_xchg_2(pcp, nval) percpu_xchg_op(volatile, pcp, nval) +#define this_cpu_xchg_4(pcp, nval) percpu_xchg_op(volatile, pcp, nval) + +#define raw_cpu_add_return_1(pcp, val) percpu_add_return_op(, pcp, val) +#define raw_cpu_add_return_2(pcp, val) percpu_add_return_op(, pcp, val) +#define raw_cpu_add_return_4(pcp, val) percpu_add_return_op(, pcp, val) +#define raw_cpu_cmpxchg_1(pcp, oval, nval) percpu_cmpxchg_op(, pcp, oval, nval) +#define raw_cpu_cmpxchg_2(pcp, oval, nval) percpu_cmpxchg_op(, pcp, oval, nval) +#define raw_cpu_cmpxchg_4(pcp, oval, nval) percpu_cmpxchg_op(, pcp, oval, nval) + +#define this_cpu_add_return_1(pcp, val) percpu_add_return_op(volatile, pcp, val) +#define this_cpu_add_return_2(pcp, val) percpu_add_return_op(volatile, pcp, val) +#define this_cpu_add_return_4(pcp, val) percpu_add_return_op(volatile, pcp, val) +#define this_cpu_cmpxchg_1(pcp, oval, nval) percpu_cmpxchg_op(volatile, pcp, oval, nval) +#define this_cpu_cmpxchg_2(pcp, oval, nval) percpu_cmpxchg_op(volatile, pcp, oval, nval) +#define this_cpu_cmpxchg_4(pcp, oval, nval) percpu_cmpxchg_op(volatile, pcp, oval, nval) #ifdef CONFIG_X86_CMPXCHG64 #define percpu_cmpxchg8b_double(pcp1, pcp2, o1, o2, n1, n2) \ @@ -466,23 +466,23 @@ do { \ * 32 bit must fall back to generic operations. */ #ifdef CONFIG_X86_64 -#define raw_cpu_read_8(pcp) percpu_from_op("mov", pcp) -#define raw_cpu_write_8(pcp, val) percpu_to_op("mov", (pcp), val) -#define raw_cpu_add_8(pcp, val) percpu_add_op((pcp), val) -#define raw_cpu_and_8(pcp, val) percpu_to_op("and", (pcp), val) -#define raw_cpu_or_8(pcp, val) percpu_to_op("or", (pcp), val) -#define raw_cpu_add_return_8(pcp, val) percpu_add_return_op(pcp, val) -#define raw_cpu_xchg_8(pcp, nval) percpu_xchg_op(pcp, nval) -#define raw_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval) - -#define this_cpu_read_8(pcp) percpu_from_op("mov", pcp) -#define this_cpu_write_8(pcp, val) percpu_to_op("mov", (pcp), val) -#define this_cpu_add_8(pcp, val) percpu_add_op((pcp), val) -#define this_cpu_and_8(pcp, val) percpu_to_op("and", (pcp), val) -#define this_cpu_or_8(pcp, val) percpu_to_op("or", (pcp), val) -#define this_cpu_add_return_8(pcp, val) percpu_add_return_op(pcp, val) -#define this_cpu_xchg_8(pcp, nval) percpu_xchg_op(pcp, nval) -#define this_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval) +#define raw_cpu_read_8(pcp) percpu_from_op(, "mov", pcp) +#define raw_cpu_write_8(pcp, val) percpu_to_op(, "mov", (pcp), val) +#define raw_cpu_add_8(pcp, val) percpu_add_op(, (pcp), val) +#define raw_cpu_and_8(pcp, val) percpu_to_op(, "and", (pcp), val) +#define raw_cpu_or_8(pcp, val) percpu_to_op(, "or", (pcp), val) +#define raw_cpu_add_return_8(pcp, val) percpu_add_return_op(, pcp, val) +#define raw_cpu_xchg_8(pcp, nval) percpu_xchg_op(, pcp, nval) +#define raw_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(, pcp, oval, nval) + +#define this_cpu_read_8(pcp) percpu_from_op(volatile, "mov", pcp) +#define this_cpu_write_8(pcp, val) percpu_to_op(volatile, "mov", (pcp), val) +#define this_cpu_add_8(pcp, val) percpu_add_op(volatile, (pcp), val) +#define this_cpu_and_8(pcp, val) percpu_to_op(volatile, "and", (pcp), val) +#define this_cpu_or_8(pcp, val) percpu_to_op(volatile, "or", (pcp), val) +#define this_cpu_add_return_8(pcp, val) percpu_add_return_op(volatile, pcp, val) +#define this_cpu_xchg_8(pcp, nval) percpu_xchg_op(volatile, pcp, nval) +#define this_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(volatile, pcp, oval, nval) /* * Pretty complex macro to generate cmpxchg16 instruction. The instruction
> On Dec 6, 2018, at 2:25 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Thu, Dec 06, 2018 at 12:28:26AM -0800, Nadav Amit wrote: >> [ +Peter ] >> >> So I dug some more (I’m still not done), and found various trivial things >> (e.g., storing zero extending u32 immediate is shorter for registers, >> inlining already takes place). >> >> *But* there is one thing that may require some attention - patch >> b59167ac7bafd ("x86/percpu: Fix this_cpu_read()”) set ordering constraints >> on the VM_ARGS() evaluation. And this patch also imposes, it appears, >> (unnecessary) constraints on other pieces of code. >> >> These constraints are due to the addition of the volatile keyword for >> this_cpu_read() by the patch. This affects at least 68 functions in my >> kernel build, some of which are hot (I think), e.g., finish_task_switch(), >> smp_x86_platform_ipi() and select_idle_sibling(). >> >> Peter, perhaps the solution was too big of a hammer? Is it possible instead >> to create a separate "this_cpu_read_once()” with the volatile keyword? Such >> a function can be used for native_sched_clock() and other seqlocks, etc. > > No. like the commit writes this_cpu_read() _must_ imply READ_ONCE(). If > you want something else, use something else, there's plenty other > options available. > > There's this_cpu_op_stable(), but also __this_cpu_read() and > raw_this_cpu_read() (which currently don't differ from this_cpu_read() > but could). Would setting the inline assembly memory operand both as input and output be better than using the “volatile”? I think that If you do that, the compiler would should the this_cpu_read() as something that changes the per-cpu-variable, which would make it invalid to re-read the value. At the same time, it would not prevent reordering the read with other stuff.
On Thu, Dec 06, 2018 at 09:26:24AM -0800, Nadav Amit wrote: > > On Dec 6, 2018, at 2:25 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Thu, Dec 06, 2018 at 12:28:26AM -0800, Nadav Amit wrote: > >> [ +Peter ] > >> > >> So I dug some more (I’m still not done), and found various trivial things > >> (e.g., storing zero extending u32 immediate is shorter for registers, > >> inlining already takes place). > >> > >> *But* there is one thing that may require some attention - patch > >> b59167ac7bafd ("x86/percpu: Fix this_cpu_read()”) set ordering constraints > >> on the VM_ARGS() evaluation. And this patch also imposes, it appears, > >> (unnecessary) constraints on other pieces of code. > >> > >> These constraints are due to the addition of the volatile keyword for > >> this_cpu_read() by the patch. This affects at least 68 functions in my > >> kernel build, some of which are hot (I think), e.g., finish_task_switch(), > >> smp_x86_platform_ipi() and select_idle_sibling(). > >> > >> Peter, perhaps the solution was too big of a hammer? Is it possible instead > >> to create a separate "this_cpu_read_once()” with the volatile keyword? Such > >> a function can be used for native_sched_clock() and other seqlocks, etc. > > > > No. like the commit writes this_cpu_read() _must_ imply READ_ONCE(). If > > you want something else, use something else, there's plenty other > > options available. > > > > There's this_cpu_op_stable(), but also __this_cpu_read() and > > raw_this_cpu_read() (which currently don't differ from this_cpu_read() > > but could). > > Would setting the inline assembly memory operand both as input and output be > better than using the “volatile”? I don't know.. I'm forever befuddled by the exact semantics of gcc inline asm. > I think that If you do that, the compiler would should the this_cpu_read() > as something that changes the per-cpu-variable, which would make it invalid > to re-read the value. At the same time, it would not prevent reordering the > read with other stuff. So the thing is; as I wrote, the generic version of this_cpu_*() is: local_irq_save(); __this_cpu_*(); local_irq_restore(); And per local_irq_{save,restore}() including compiler barriers that cannot be reordered around either. And per the principle of least surprise, I think our primitives should have similar semantics. I'm actually having difficulty finding the this_cpu_read() in any of the functions you mention, so I cannot make any concrete suggestions other than pointing at the alternative functions available.
[ We can start a new thread, since I have the tendency to hijack threads. ] > On Dec 7, 2018, at 12:45 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Thu, Dec 06, 2018 at 09:26:24AM -0800, Nadav Amit wrote: >>> On Dec 6, 2018, at 2:25 AM, Peter Zijlstra <peterz@infradead.org> wrote: >>> >>> On Thu, Dec 06, 2018 at 12:28:26AM -0800, Nadav Amit wrote: >>>> [ +Peter ] >>>> >>>> So I dug some more (I’m still not done), and found various trivial things >>>> (e.g., storing zero extending u32 immediate is shorter for registers, >>>> inlining already takes place). >>>> >>>> *But* there is one thing that may require some attention - patch >>>> b59167ac7bafd ("x86/percpu: Fix this_cpu_read()”) set ordering constraints >>>> on the VM_ARGS() evaluation. And this patch also imposes, it appears, >>>> (unnecessary) constraints on other pieces of code. >>>> >>>> These constraints are due to the addition of the volatile keyword for >>>> this_cpu_read() by the patch. This affects at least 68 functions in my >>>> kernel build, some of which are hot (I think), e.g., finish_task_switch(), >>>> smp_x86_platform_ipi() and select_idle_sibling(). >>>> >>>> Peter, perhaps the solution was too big of a hammer? Is it possible instead >>>> to create a separate "this_cpu_read_once()” with the volatile keyword? Such >>>> a function can be used for native_sched_clock() and other seqlocks, etc. >>> >>> No. like the commit writes this_cpu_read() _must_ imply READ_ONCE(). If >>> you want something else, use something else, there's plenty other >>> options available. >>> >>> There's this_cpu_op_stable(), but also __this_cpu_read() and >>> raw_this_cpu_read() (which currently don't differ from this_cpu_read() >>> but could). >> >> Would setting the inline assembly memory operand both as input and output be >> better than using the “volatile”? > > I don't know.. I'm forever befuddled by the exact semantics of gcc > inline asm. > >> I think that If you do that, the compiler would should the this_cpu_read() >> as something that changes the per-cpu-variable, which would make it invalid >> to re-read the value. At the same time, it would not prevent reordering the >> read with other stuff. > > So the thing is; as I wrote, the generic version of this_cpu_*() is: > > local_irq_save(); > __this_cpu_*(); > local_irq_restore(); > > And per local_irq_{save,restore}() including compiler barriers that > cannot be reordered around either. > > And per the principle of least surprise, I think our primitives should > have similar semantics. I guess so, but as you’ll see below, the end result is ugly. > I'm actually having difficulty finding the this_cpu_read() in any of the > functions you mention, so I cannot make any concrete suggestions other > than pointing at the alternative functions available. So I got deeper into the code to understand a couple of differences. In the case of select_idle_sibling(), the patch (Peter’s) increase the function code size by 123 bytes (over the baseline of 986). The per-cpu variable is called through the following call chain: select_idle_sibling() => select_idle_cpu() => local_clock() => raw_smp_processor_id() And results in 2 more calls to sched_clock_cpu(), as the compiler assumes the processor id changes in between (which obviously wouldn’t happen). There may be more changes around, which I didn’t fully analyze. But the very least reading the processor id should not get “volatile”. As for finish_task_switch(), the impact is only few bytes, but still unnecessary. It appears that with your patch preempt_count() causes multiple reads of __preempt_count in this code: if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET, "corrupted preempt_count: %s/%d/0x%x\n", current->comm, current->pid, preempt_count())) preempt_count_set(FORK_PREEMPT_COUNT); Again, this is unwarranted, as the preemption count should not be changed in any interrupt.
[Resend, changing title & adding lkml and some others ] On Dec 7, 2018, at 3:12 PM, Nadav Amit <nadav.amit@gmail.com> wrote: [ We can start a new thread, since I have the tendency to hijack threads. ] > On Dec 7, 2018, at 12:45 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Thu, Dec 06, 2018 at 09:26:24AM -0800, Nadav Amit wrote: >>> On Dec 6, 2018, at 2:25 AM, Peter Zijlstra <peterz@infradead.org> wrote: >>> >>> On Thu, Dec 06, 2018 at 12:28:26AM -0800, Nadav Amit wrote: >>>> [ +Peter ] >>>> [snip] >>>> >>>> *But* there is one thing that may require some attention - patch >>>> b59167ac7bafd ("x86/percpu: Fix this_cpu_read()”) set ordering constraints >>>> on the VM_ARGS() evaluation. And this patch also imposes, it appears, >>>> (unnecessary) constraints on other pieces of code. >>>> >>>> These constraints are due to the addition of the volatile keyword for >>>> this_cpu_read() by the patch. This affects at least 68 functions in my >>>> kernel build, some of which are hot (I think), e.g., finish_task_switch(), >>>> smp_x86_platform_ipi() and select_idle_sibling(). >>>> >>>> Peter, perhaps the solution was too big of a hammer? Is it possible instead >>>> to create a separate "this_cpu_read_once()” with the volatile keyword? Such >>>> a function can be used for native_sched_clock() and other seqlocks, etc. >>> >>> No. like the commit writes this_cpu_read() _must_ imply READ_ONCE(). If >>> you want something else, use something else, there's plenty other >>> options available. >>> >>> There's this_cpu_op_stable(), but also __this_cpu_read() and >>> raw_this_cpu_read() (which currently don't differ from this_cpu_read() >>> but could). >> >> Would setting the inline assembly memory operand both as input and output be >> better than using the “volatile”? > > I don't know.. I'm forever befuddled by the exact semantics of gcc > inline asm. > >> I think that If you do that, the compiler would should the this_cpu_read() >> as something that changes the per-cpu-variable, which would make it invalid >> to re-read the value. At the same time, it would not prevent reordering the >> read with other stuff. > > So the thing is; as I wrote, the generic version of this_cpu_*() is: > > local_irq_save(); > __this_cpu_*(); > local_irq_restore(); > > And per local_irq_{save,restore}() including compiler barriers that > cannot be reordered around either. > > And per the principle of least surprise, I think our primitives should > have similar semantics. I guess so, but as you’ll see below, the end result is ugly. > I'm actually having difficulty finding the this_cpu_read() in any of the > functions you mention, so I cannot make any concrete suggestions other > than pointing at the alternative functions available. So I got deeper into the code to understand a couple of differences. In the case of select_idle_sibling(), the patch (Peter’s) increase the function code size by 123 bytes (over the baseline of 986). The per-cpu variable is called through the following call chain: select_idle_sibling() => select_idle_cpu() => local_clock() => raw_smp_processor_id() And results in 2 more calls to sched_clock_cpu(), as the compiler assumes the processor id changes in between (which obviously wouldn’t happen). There may be more changes around, which I didn’t fully analyze. But the very least reading the processor id should not get “volatile”. As for finish_task_switch(), the impact is only few bytes, but still unnecessary. It appears that with your patch preempt_count() causes multiple reads of __preempt_count in this code: if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET, "corrupted preempt_count: %s/%d/0x%x\n", current->comm, current->pid, preempt_count())) preempt_count_set(FORK_PREEMPT_COUNT); Again, this is unwarranted, as the preemption count should not be changed in any interrupt.
On Fri, Dec 07, 2018 at 04:40:52PM -0800, Nadav Amit wrote: > > I'm actually having difficulty finding the this_cpu_read() in any of the > > functions you mention, so I cannot make any concrete suggestions other > > than pointing at the alternative functions available. > > > So I got deeper into the code to understand a couple of differences. In the > case of select_idle_sibling(), the patch (Peter’s) increase the function > code size by 123 bytes (over the baseline of 986). The per-cpu variable is > called through the following call chain: > > select_idle_sibling() > => select_idle_cpu() > => local_clock() > => raw_smp_processor_id() > > And results in 2 more calls to sched_clock_cpu(), as the compiler assumes > the processor id changes in between (which obviously wouldn’t happen). That is the thing with raw_smp_processor_id(), it is allowed to be used in preemptible context, and there it _obviously_ can change between subsequent invocations. So again, this change is actually good. If we want to fix select_idle_cpu(), we should maybe not use local_clock() there but use sched_clock_cpu() with a stable argument, this code runs with IRQs disabled and therefore the CPU number is stable for us here. > There may be more changes around, which I didn’t fully analyze. But > the very least reading the processor id should not get “volatile”. > > As for finish_task_switch(), the impact is only few bytes, but still > unnecessary. It appears that with your patch preempt_count() causes multiple > reads of __preempt_count in this code: > > if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET, > "corrupted preempt_count: %s/%d/0x%x\n", > current->comm, current->pid, preempt_count())) > preempt_count_set(FORK_PREEMPT_COUNT); My patch proposed here: https://marc.info/?l=linux-mm&m=154409548410209 would actually fix that one I think, preempt_count() uses raw_cpu_read_4() which will loose the volatile with that patch.
> On Dec 8, 2018, at 2:52 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Dec 07, 2018 at 04:40:52PM -0800, Nadav Amit wrote: > >>> I'm actually having difficulty finding the this_cpu_read() in any of the >>> functions you mention, so I cannot make any concrete suggestions other >>> than pointing at the alternative functions available. >> >> >> So I got deeper into the code to understand a couple of differences. In the >> case of select_idle_sibling(), the patch (Peter’s) increase the function >> code size by 123 bytes (over the baseline of 986). The per-cpu variable is >> called through the following call chain: >> >> select_idle_sibling() >> => select_idle_cpu() >> => local_clock() >> => raw_smp_processor_id() >> >> And results in 2 more calls to sched_clock_cpu(), as the compiler assumes >> the processor id changes in between (which obviously wouldn’t happen). > > That is the thing with raw_smp_processor_id(), it is allowed to be used > in preemptible context, and there it _obviously_ can change between > subsequent invocations. > > So again, this change is actually good. > > If we want to fix select_idle_cpu(), we should maybe not use > local_clock() there but use sched_clock_cpu() with a stable argument, > this code runs with IRQs disabled and therefore the CPU number is stable > for us here. > >> There may be more changes around, which I didn’t fully analyze. But >> the very least reading the processor id should not get “volatile”. >> >> As for finish_task_switch(), the impact is only few bytes, but still >> unnecessary. It appears that with your patch preempt_count() causes multiple >> reads of __preempt_count in this code: >> >> if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET, >> "corrupted preempt_count: %s/%d/0x%x\n", >> current->comm, current->pid, preempt_count())) >> preempt_count_set(FORK_PREEMPT_COUNT); > > My patch proposed here: > > https://marc.info/?l=linux-mm&m=154409548410209 > > would actually fix that one I think, preempt_count() uses > raw_cpu_read_4() which will loose the volatile with that patch. Sorry for the spam from yesterday. That what happens when I try to write emails on my phone while I’m distracted. I tested the patch you referenced, and it certainly improves the situation for reads, but there are still small and big issues lying around. The biggest one is that (I think) smp_processor_id() should apparently use __this_cpu_read(). Anyhow, when preemption checks are on, it is validated that smp_processor_id() is not used while preemption is enabled, and IRQs are not supposed to change its value. Otherwise the generated code of many functions is affected. There are all kind of other smaller issues, such as set_irq_regs() and get_irq_regs(), which should run with disabled interrupts. They affect the generated code in do_IRQ() and others. But beyond that, there are so many places in the code that use this_cpu_read() while IRQs are guaranteed to be disabled. For example arch/x86/mm/tlb.c is full with this_cpu_read/write() and almost(?) all should be running with interrupts disabled. Having said that, in my build only flush_tlb_func_common() was affected.
On Sun, Dec 09, 2018 at 04:57:43PM -0800, Nadav Amit wrote: > > On Dec 8, 2018, at 2:52 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > My patch proposed here: > > > > https://marc.info/?l=linux-mm&m=154409548410209 > > > > would actually fix that one I think, preempt_count() uses > > raw_cpu_read_4() which will loose the volatile with that patch. > I tested the patch you referenced, and it certainly improves the situation > for reads, but there are still small and big issues lying around. I'm sure :-(, this has been 'festering' for a long while it seems. And esp. on x86 specific code, where for a long time we all assumed the various per-cpu APIs were in fact the same (which turns out to very much not be true). > The biggest one is that (I think) smp_processor_id() should apparently use > __this_cpu_read(). Agreed, and note that this will also improve code generation on !x86. However, I'm not sure the current !debug definition: #define smp_processor_id() raw_smp_processor_id() is actually correct. Where raw_smp_processor_id() must be this_cpu_read() to avoid CSE, we actually want to allow CSE on smp_processor_id() etc.. > There are all kind of other smaller issues, such as set_irq_regs() and > get_irq_regs(), which should run with disabled interrupts. They affect the > generated code in do_IRQ() and others. > > But beyond that, there are so many places in the code that use > this_cpu_read() while IRQs are guaranteed to be disabled. For example > arch/x86/mm/tlb.c is full with this_cpu_read/write() and almost(?) all > should be running with interrupts disabled. Having said that, in my build > only flush_tlb_func_common() was affected. This all feels like something static analysis could help with; such tools would also make sense for !x86 where the difference between the various per-cpu accessors is even bigger.
> On Dec 10, 2018, at 12:55 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Sun, Dec 09, 2018 at 04:57:43PM -0800, Nadav Amit wrote: >>> On Dec 8, 2018, at 2:52 AM, Peter Zijlstra <peterz@infradead.org> wrote: > >>> My patch proposed here: >>> >>> https://marc.info/?l=linux-mm&m=154409548410209 >>> >>> would actually fix that one I think, preempt_count() uses >>> raw_cpu_read_4() which will loose the volatile with that patch. > >> I tested the patch you referenced, and it certainly improves the situation >> for reads, but there are still small and big issues lying around. > > I'm sure :-(, this has been 'festering' for a long while it seems. And > esp. on x86 specific code, where for a long time we all assumed the > various per-cpu APIs were in fact the same (which turns out to very much > not be true). > >> The biggest one is that (I think) smp_processor_id() should apparently use >> __this_cpu_read(). > > Agreed, and note that this will also improve code generation on !x86. > > However, I'm not sure the current !debug definition: > > #define smp_processor_id() raw_smp_processor_id() > > is actually correct. Where raw_smp_processor_id() must be > this_cpu_read() to avoid CSE, we actually want to allow CSE on > smp_processor_id() etc.. Yes. That makes sense. > >> There are all kind of other smaller issues, such as set_irq_regs() and >> get_irq_regs(), which should run with disabled interrupts. They affect the >> generated code in do_IRQ() and others. >> >> But beyond that, there are so many places in the code that use >> this_cpu_read() while IRQs are guaranteed to be disabled. For example >> arch/x86/mm/tlb.c is full with this_cpu_read/write() and almost(?) all >> should be running with interrupts disabled. Having said that, in my build >> only flush_tlb_func_common() was affected. > > This all feels like something static analysis could help with; such > tools would also make sense for !x86 where the difference between the > various per-cpu accessors is even bigger. If something like that existed, it could also allow to get rid of local_irq_save() (and use local_irq_disable() instead).
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 97d4b25d0373..3bd9b1bcb702 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -395,13 +395,26 @@ static void purge_vmap_area_lazy(void); static BLOCKING_NOTIFIER_HEAD(vmap_notify_list); +struct vm_args { + unsigned long size; + unsigned long align; + unsigned long start; + unsigned long end; +}; + +#define VM_ARGS(name, _size, _align, _start, _end) \ + struct vm_args name = { \ + .size = (_size), \ + .align = (_align), \ + .start = (_start), \ + .end = (_end), \ + } + /* * Allocate a region of KVA of the specified size and alignment, within the - * vstart and vend. + * args->start and args->end. */ -static struct vmap_area *alloc_vmap_area(unsigned long size, - unsigned long align, - unsigned long vstart, unsigned long vend, +static struct vmap_area *alloc_vmap_area(struct vm_args *args, int node, gfp_t gfp_mask) { struct vmap_area *va; @@ -409,10 +422,11 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, unsigned long addr; int purged = 0; struct vmap_area *first; + unsigned long size = args->size; BUG_ON(!size); BUG_ON(offset_in_page(size)); - BUG_ON(!is_power_of_2(align)); + BUG_ON(!is_power_of_2(args->align)); might_sleep(); @@ -433,34 +447,34 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, * Invalidate cache if we have more permissive parameters. * cached_hole_size notes the largest hole noticed _below_ * the vmap_area cached in free_vmap_cache: if size fits - * into that hole, we want to scan from vstart to reuse + * into that hole, we want to scan from args->start to reuse * the hole instead of allocating above free_vmap_cache. * Note that __free_vmap_area may update free_vmap_cache * without updating cached_hole_size or cached_align. */ if (!free_vmap_cache || size < cached_hole_size || - vstart < cached_vstart || - align < cached_align) { + args->start < cached_vstart || + args->align < cached_align) { nocache: cached_hole_size = 0; free_vmap_cache = NULL; } /* record if we encounter less permissive parameters */ - cached_vstart = vstart; - cached_align = align; + cached_vstart = args->start; + cached_align = args->align; /* find starting point for our search */ if (free_vmap_cache) { first = rb_entry(free_vmap_cache, struct vmap_area, rb_node); - addr = ALIGN(first->va_end, align); - if (addr < vstart) + addr = ALIGN(first->va_end, args->align); + if (addr < args->start) goto nocache; if (addr + size < addr) goto overflow; } else { - addr = ALIGN(vstart, align); + addr = ALIGN(args->start, args->align); if (addr + size < addr) goto overflow; @@ -484,10 +498,10 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, } /* from the starting point, walk areas until a suitable hole is found */ - while (addr + size > first->va_start && addr + size <= vend) { + while (addr + size > first->va_start && addr + size <= args->end) { if (addr + cached_hole_size < first->va_start) cached_hole_size = first->va_start - addr; - addr = ALIGN(first->va_end, align); + addr = ALIGN(first->va_end, args->align); if (addr + size < addr) goto overflow; @@ -498,7 +512,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, } found: - if (addr + size > vend) + if (addr + size > args->end) goto overflow; va->va_start = addr; @@ -508,9 +522,9 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, free_vmap_cache = &va->rb_node; spin_unlock(&vmap_area_lock); - BUG_ON(!IS_ALIGNED(va->va_start, align)); - BUG_ON(va->va_start < vstart); - BUG_ON(va->va_end > vend); + BUG_ON(!IS_ALIGNED(va->va_start, args->align)); + BUG_ON(va->va_start < args->start); + BUG_ON(va->va_end > args->end); return va; @@ -844,6 +858,8 @@ static void *vmap_block_vaddr(unsigned long va_start, unsigned long pages_off) */ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask) { + VM_ARGS(args, VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE, + VMALLOC_START, VMALLOC_END); struct vmap_block_queue *vbq; struct vmap_block *vb; struct vmap_area *va; @@ -858,9 +874,7 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask) if (unlikely(!vb)) return ERR_PTR(-ENOMEM); - va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE, - VMALLOC_START, VMALLOC_END, - node, gfp_mask); + va = alloc_vmap_area(&args, node, gfp_mask); if (IS_ERR(va)) { kfree(vb); return ERR_CAST(va); @@ -1169,9 +1183,9 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node, pgprot_t pro return NULL; addr = (unsigned long)mem; } else { - struct vmap_area *va; - va = alloc_vmap_area(size, PAGE_SIZE, - VMALLOC_START, VMALLOC_END, node, GFP_KERNEL); + VM_ARGS(args, size, PAGE_SIZE, VMALLOC_START, VMALLOC_END); + struct vmap_area *va = alloc_vmap_area(&args, node, GFP_KERNEL); + if (IS_ERR(va)) return NULL; @@ -1370,56 +1384,57 @@ static void clear_vm_uninitialized_flag(struct vm_struct *vm) vm->flags &= ~VM_UNINITIALIZED; } -static struct vm_struct *__get_vm_area_node(unsigned long size, - unsigned long align, unsigned long flags, unsigned long start, - unsigned long end, int node, gfp_t gfp_mask, const void *caller) +static struct vm_struct *__get_vm_area_node(struct vm_args *args, int node, + gfp_t gfp, unsigned long vm_flags, const void *caller) { struct vmap_area *va; struct vm_struct *area; + unsigned long size; BUG_ON(in_interrupt()); - size = PAGE_ALIGN(size); + size = PAGE_ALIGN(args->size); if (unlikely(!size)) return NULL; - if (flags & VM_IOREMAP) - align = 1ul << clamp_t(int, get_count_order_long(size), + if (vm_flags & VM_IOREMAP) + args->align = 1ul << clamp_t(int, get_count_order_long(size), PAGE_SHIFT, IOREMAP_MAX_ORDER); - area = kzalloc_node(sizeof(*area), gfp_mask & GFP_RECLAIM_MASK, node); + area = kzalloc_node(sizeof(*area), gfp & GFP_RECLAIM_MASK, node); if (unlikely(!area)) return NULL; - if (!(flags & VM_NO_GUARD)) + if (!(vm_flags & VM_NO_GUARD)) size += PAGE_SIZE; + args->size = size; - va = alloc_vmap_area(size, align, start, end, node, gfp_mask); + va = alloc_vmap_area(args, node, gfp); if (IS_ERR(va)) { kfree(area); return NULL; } - setup_vmalloc_vm(area, va, flags, caller); + setup_vmalloc_vm(area, va, vm_flags, caller); return area; } -struct vm_struct *__get_vm_area(unsigned long size, unsigned long flags, - unsigned long start, unsigned long end) -{ - return __get_vm_area_node(size, 1, flags, start, end, NUMA_NO_NODE, - GFP_KERNEL, __builtin_return_address(0)); -} -EXPORT_SYMBOL_GPL(__get_vm_area); - struct vm_struct *__get_vm_area_caller(unsigned long size, unsigned long flags, unsigned long start, unsigned long end, const void *caller) { - return __get_vm_area_node(size, 1, flags, start, end, NUMA_NO_NODE, - GFP_KERNEL, caller); + VM_ARGS(args, size, 1, start, end); + return __get_vm_area_node(&args, NUMA_NO_NODE, GFP_KERNEL, flags, caller); } +struct vm_struct *__get_vm_area(unsigned long size, unsigned long flags, + unsigned long start, unsigned long end) +{ + return __get_vm_area_caller(size, flags, start, end, + __builtin_return_address(0)); +} +EXPORT_SYMBOL_GPL(__get_vm_area); + /** * get_vm_area - reserve a contiguous kernel virtual area * @size: size of the area @@ -1431,16 +1446,14 @@ struct vm_struct *__get_vm_area_caller(unsigned long size, unsigned long flags, */ struct vm_struct *get_vm_area(unsigned long size, unsigned long flags) { - return __get_vm_area_node(size, 1, flags, VMALLOC_START, VMALLOC_END, - NUMA_NO_NODE, GFP_KERNEL, - __builtin_return_address(0)); + return __get_vm_area(size, flags, VMALLOC_START, VMALLOC_END); } struct vm_struct *get_vm_area_caller(unsigned long size, unsigned long flags, const void *caller) { - return __get_vm_area_node(size, 1, flags, VMALLOC_START, VMALLOC_END, - NUMA_NO_NODE, GFP_KERNEL, caller); + return __get_vm_area_caller(size, flags, VMALLOC_START, VMALLOC_END, + caller); } /** @@ -1734,6 +1747,7 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, pgprot_t prot, unsigned long vm_flags, int node, const void *caller) { + VM_ARGS(args, size, align, start, end); struct vm_struct *area; void *addr; unsigned long real_size = size; @@ -1741,9 +1755,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, size = PAGE_ALIGN(size); if (!size || (size >> PAGE_SHIFT) > totalram_pages) goto fail; + args.size = size; - area = __get_vm_area_node(size, align, VM_ALLOC | VM_UNINITIALIZED | - vm_flags, start, end, node, gfp_mask, caller); + area = __get_vm_area_node(&args, node, gfp_mask, + vm_flags | VM_ALLOC | VM_UNINITIALIZED, caller); if (!area) goto fail;
Some of the functions in vmalloc.c have as many as nine arguments. So I thought I'd have a quick go at bundling the ones that make sense into a struct and pass around a pointer to that struct. Well, it made the generated code worse, so I thought I'd share my attempt so nobody else bothers (or soebody points out that I did something stupid). I tried a few variations on this theme; bundling gfp_t and node into the struct made it even worse, as did adding caller and vm_flags. This is the least bad version. (Yes, the naming is bad; I'm not tidying this up for submission, I'm showing an experiment that didn't work). Nacked-by: Matthew Wilcox <willy@infradead.org>