Number of arguments in vmalloc.c

Message ID	20181128140136.GG10377@bombadil.infradead.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: best guess record for domain of willy@infradead.org designates 2607:7c80:54:e::133 as permitted sender) client-ip=2607:7c80:54:e::133; Date: Wed, 28 Nov 2018 06:01:36 -0800 From: Matthew Wilcox <willy@infradead.org> To: linux-mm@kvack.org Subject: Number of arguments in vmalloc.c Message-ID: <20181128140136.GG10377@bombadil.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.9.2 (2017-12-15) Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Number of arguments in vmalloc.c \| expand Number of arguments in vmalloc.c

Matthew Wilcox Nov. 28, 2018, 2:01 p.m. UTC

Some of the functions in vmalloc.c have as many as nine arguments.
So I thought I'd have a quick go at bundling the ones that make sense
into a struct and pass around a pointer to that struct.  Well, it made
the generated code worse, so I thought I'd share my attempt so nobody
else bothers (or soebody points out that I did something stupid).

I tried a few variations on this theme; bundling gfp_t and node into
the struct made it even worse, as did adding caller and vm_flags.  This
is the least bad version.

(Yes, the naming is bad; I'm not tidying this up for submission, I'm
showing an experiment that didn't work).

Nacked-by: Matthew Wilcox <willy@infradead.org>

Vlastimil Babka Dec. 3, 2018, 1:59 p.m. UTC | #1

On 11/28/18 3:01 PM, Matthew Wilcox wrote:
> 
> Some of the functions in vmalloc.c have as many as nine arguments.
> So I thought I'd have a quick go at bundling the ones that make sense
> into a struct and pass around a pointer to that struct.  Well, it made
> the generated code worse,

Worse in which metric?

> so I thought I'd share my attempt so nobody
> else bothers (or soebody points out that I did something stupid).

I guess in some of the functions the args parameter could be const?
Might make some difference.

Anyway this shouldn't be a fast path, so even if the generated code is
e.g. somewhat larger, then it still might make sense to reduce the
insane parameter lists.

> I tried a few variations on this theme; bundling gfp_t and node into
> the struct made it even worse, as did adding caller and vm_flags.  This
> is the least bad version.
> 
> (Yes, the naming is bad; I'm not tidying this up for submission, I'm
> showing an experiment that didn't work).
> 
> Nacked-by: Matthew Wilcox <willy@infradead.org>
>

Matthew Wilcox Dec. 3, 2018, 4:13 p.m. UTC | #2

On Mon, Dec 03, 2018 at 02:59:36PM +0100, Vlastimil Babka wrote:
> On 11/28/18 3:01 PM, Matthew Wilcox wrote:
> > 
> > Some of the functions in vmalloc.c have as many as nine arguments.
> > So I thought I'd have a quick go at bundling the ones that make sense
> > into a struct and pass around a pointer to that struct.  Well, it made
> > the generated code worse,
> 
> Worse in which metric?

More instructions to accomplish the same thing.

> > so I thought I'd share my attempt so nobody
> > else bothers (or soebody points out that I did something stupid).
> 
> I guess in some of the functions the args parameter could be const?
> Might make some difference.
> 
> Anyway this shouldn't be a fast path, so even if the generated code is
> e.g. somewhat larger, then it still might make sense to reduce the
> insane parameter lists.

It might ... I'm not sure it's even easier to program than the original
though.

Nadav Amit Dec. 3, 2018, 10:04 p.m. UTC | #3

> On Dec 3, 2018, at 8:13 AM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Mon, Dec 03, 2018 at 02:59:36PM +0100, Vlastimil Babka wrote:
>> On 11/28/18 3:01 PM, Matthew Wilcox wrote:
>>> Some of the functions in vmalloc.c have as many as nine arguments.
>>> So I thought I'd have a quick go at bundling the ones that make sense
>>> into a struct and pass around a pointer to that struct.  Well, it made
>>> the generated code worse,
>> 
>> Worse in which metric?
> 
> More instructions to accomplish the same thing.
> 
>>> so I thought I'd share my attempt so nobody
>>> else bothers (or soebody points out that I did something stupid).
>> 
>> I guess in some of the functions the args parameter could be const?
>> Might make some difference.
>> 
>> Anyway this shouldn't be a fast path, so even if the generated code is
>> e.g. somewhat larger, then it still might make sense to reduce the
>> insane parameter lists.
> 
> It might ... I'm not sure it's even easier to program than the original
> though.

My intuition is that if all the fields of vm_args were initialized together
(in the same function), and a 'const struct vm_args *' was provided as
an argument to other functions, code would be better (at least better than
what you got right now).

I’m not saying it is easily applicable in this use-case (since I didn’t
check).

Matthew Wilcox Dec. 3, 2018, 10:49 p.m. UTC | #4

On Mon, Dec 03, 2018 at 02:04:41PM -0800, Nadav Amit wrote:
> On Dec 3, 2018, at 8:13 AM, Matthew Wilcox <willy@infradead.org> wrote:
> > On Mon, Dec 03, 2018 at 02:59:36PM +0100, Vlastimil Babka wrote:
> >> On 11/28/18 3:01 PM, Matthew Wilcox wrote:
> >>> Some of the functions in vmalloc.c have as many as nine arguments.
> >>> So I thought I'd have a quick go at bundling the ones that make sense
> >>> into a struct and pass around a pointer to that struct.  Well, it made
> >>> the generated code worse,
> >> 
> >> Worse in which metric?
> > 
> > More instructions to accomplish the same thing.
> > 
> >>> so I thought I'd share my attempt so nobody
> >>> else bothers (or soebody points out that I did something stupid).
> >> 
> >> I guess in some of the functions the args parameter could be const?
> >> Might make some difference.
> >> 
> >> Anyway this shouldn't be a fast path, so even if the generated code is
> >> e.g. somewhat larger, then it still might make sense to reduce the
> >> insane parameter lists.
> > 
> > It might ... I'm not sure it's even easier to program than the original
> > though.
> 
> My intuition is that if all the fields of vm_args were initialized together
> (in the same function), and a 'const struct vm_args *' was provided as
> an argument to other functions, code would be better (at least better than
> what you got right now).
> 
> I’m not saying it is easily applicable in this use-case (since I didn’t
> check).

Your intuition is wrong ...

   text	   data	    bss	    dec	    hex	filename
   9466	     81	     32	   9579	   256b	before.o
   9546	     81	     32	   9659	   25bb	.build-tiny/mm/vmalloc.o
   9546	     81	     32	   9659	   25bb	const.o

indeed, there's no difference between with or without the const, according
to 'cmp'.

Now, only alloc_vmap_area() gets to take a const argument.
__get_vm_area_node() intentionally modifies the arguments.  But feel
free to play around with this; you might be able to make it do something
worthwhile.

Nadav Amit Dec. 4, 2018, 3:12 a.m. UTC | #5

> On Dec 3, 2018, at 2:49 PM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Mon, Dec 03, 2018 at 02:04:41PM -0800, Nadav Amit wrote:
>> On Dec 3, 2018, at 8:13 AM, Matthew Wilcox <willy@infradead.org> wrote:
>>> On Mon, Dec 03, 2018 at 02:59:36PM +0100, Vlastimil Babka wrote:
>>>> On 11/28/18 3:01 PM, Matthew Wilcox wrote:
>>>>> Some of the functions in vmalloc.c have as many as nine arguments.
>>>>> So I thought I'd have a quick go at bundling the ones that make sense
>>>>> into a struct and pass around a pointer to that struct.  Well, it made
>>>>> the generated code worse,
>>>> 
>>>> Worse in which metric?
>>> 
>>> More instructions to accomplish the same thing.
>>> 
>>>>> so I thought I'd share my attempt so nobody
>>>>> else bothers (or soebody points out that I did something stupid).
>>>> 
>>>> I guess in some of the functions the args parameter could be const?
>>>> Might make some difference.
>>>> 
>>>> Anyway this shouldn't be a fast path, so even if the generated code is
>>>> e.g. somewhat larger, then it still might make sense to reduce the
>>>> insane parameter lists.
>>> 
>>> It might ... I'm not sure it's even easier to program than the original
>>> though.
>> 
>> My intuition is that if all the fields of vm_args were initialized together
>> (in the same function), and a 'const struct vm_args *' was provided as
>> an argument to other functions, code would be better (at least better than
>> what you got right now).
>> 
>> I’m not saying it is easily applicable in this use-case (since I didn’t
>> check).
> 
> Your intuition is wrong ...
> 
>   text	   data	    bss	    dec	    hex	filename
>   9466	     81	     32	   9579	   256b	before.o
>   9546	     81	     32	   9659	   25bb	.build-tiny/mm/vmalloc.o
>   9546	     81	     32	   9659	   25bb	const.o
> 
> indeed, there's no difference between with or without the const, according
> to 'cmp'.
> 
> Now, only alloc_vmap_area() gets to take a const argument.
> __get_vm_area_node() intentionally modifies the arguments.  But feel
> free to play around with this; you might be able to make it do something
> worthwhile.

I was playing with it (a bit). What I suggested (modifying
__get_vm_area_node() so it will not change arguments) helps a bit, but not
much.

One insight that I got is that at least part of the overhead comes from the
the stack protector code that gcc emits.

Nadav Amit Dec. 6, 2018, 8:28 a.m. UTC | #6

> On Dec 3, 2018, at 7:12 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> 
>> On Dec 3, 2018, at 2:49 PM, Matthew Wilcox <willy@infradead.org> wrote:
>> 
>> On Mon, Dec 03, 2018 at 02:04:41PM -0800, Nadav Amit wrote:
>>> On Dec 3, 2018, at 8:13 AM, Matthew Wilcox <willy@infradead.org> wrote:
>>>> On Mon, Dec 03, 2018 at 02:59:36PM +0100, Vlastimil Babka wrote:
>>>>> On 11/28/18 3:01 PM, Matthew Wilcox wrote:
>>>>>> Some of the functions in vmalloc.c have as many as nine arguments.
>>>>>> So I thought I'd have a quick go at bundling the ones that make sense
>>>>>> into a struct and pass around a pointer to that struct.  Well, it made
>>>>>> the generated code worse,
>>>>> 
>>>>> Worse in which metric?
>>>> 
>>>> More instructions to accomplish the same thing.
>>>> 
>>>>>> so I thought I'd share my attempt so nobody
>>>>>> else bothers (or soebody points out that I did something stupid).
>>>>> 
>>>>> I guess in some of the functions the args parameter could be const?
>>>>> Might make some difference.
>>>>> 
>>>>> Anyway this shouldn't be a fast path, so even if the generated code is
>>>>> e.g. somewhat larger, then it still might make sense to reduce the
>>>>> insane parameter lists.
>>>> 
>>>> It might ... I'm not sure it's even easier to program than the original
>>>> though.
>>> 
>>> My intuition is that if all the fields of vm_args were initialized together
>>> (in the same function), and a 'const struct vm_args *' was provided as
>>> an argument to other functions, code would be better (at least better than
>>> what you got right now).
>>> 
>>> I’m not saying it is easily applicable in this use-case (since I didn’t
>>> check).
>> 
>> Your intuition is wrong ...
>> 
>>  text	   data	    bss	    dec	    hex	filename
>>  9466	     81	     32	   9579	   256b	before.o
>>  9546	     81	     32	   9659	   25bb	.build-tiny/mm/vmalloc.o
>>  9546	     81	     32	   9659	   25bb	const.o
>> 
>> indeed, there's no difference between with or without the const, according
>> to 'cmp'.
>> 
>> Now, only alloc_vmap_area() gets to take a const argument.
>> __get_vm_area_node() intentionally modifies the arguments.  But feel
>> free to play around with this; you might be able to make it do something
>> worthwhile.
> 
> I was playing with it (a bit). What I suggested (modifying
> __get_vm_area_node() so it will not change arguments) helps a bit, but not
> much.
> 
> One insight that I got is that at least part of the overhead comes from the
> the stack protector code that gcc emits.

[ +Peter ]

So I dug some more (I’m still not done), and found various trivial things
(e.g., storing zero extending u32 immediate is shorter for registers,
inlining already takes place).

*But* there is one thing that may require some attention - patch
b59167ac7bafd ("x86/percpu: Fix this_cpu_read()”) set ordering constraints
on the VM_ARGS() evaluation. And this patch also imposes, it appears,
(unnecessary) constraints on other pieces of code.

These constraints are due to the addition of the volatile keyword for
this_cpu_read() by the patch. This affects at least 68 functions in my
kernel build, some of which are hot (I think), e.g., finish_task_switch(),
smp_x86_platform_ipi() and select_idle_sibling().

Peter, perhaps the solution was too big of a hammer? Is it possible instead
to create a separate "this_cpu_read_once()” with the volatile keyword? Such
a function can be used for native_sched_clock() and other seqlocks, etc.

Peter Zijlstra Dec. 6, 2018, 10:25 a.m. UTC | #7

On Thu, Dec 06, 2018 at 12:28:26AM -0800, Nadav Amit wrote:
> [ +Peter ]
> 
> So I dug some more (I’m still not done), and found various trivial things
> (e.g., storing zero extending u32 immediate is shorter for registers,
> inlining already takes place).
> 
> *But* there is one thing that may require some attention - patch
> b59167ac7bafd ("x86/percpu: Fix this_cpu_read()”) set ordering constraints
> on the VM_ARGS() evaluation. And this patch also imposes, it appears,
> (unnecessary) constraints on other pieces of code.
> 
> These constraints are due to the addition of the volatile keyword for
> this_cpu_read() by the patch. This affects at least 68 functions in my
> kernel build, some of which are hot (I think), e.g., finish_task_switch(),
> smp_x86_platform_ipi() and select_idle_sibling().
> 
> Peter, perhaps the solution was too big of a hammer? Is it possible instead
> to create a separate "this_cpu_read_once()” with the volatile keyword? Such
> a function can be used for native_sched_clock() and other seqlocks, etc.

No. like the commit writes this_cpu_read() _must_ imply READ_ONCE(). If
you want something else, use something else, there's plenty other
options available.

There's this_cpu_op_stable(), but also __this_cpu_read() and
raw_this_cpu_read() (which currently don't differ from this_cpu_read()
but could).

Peter Zijlstra Dec. 6, 2018, 11:24 a.m. UTC | #8

On Thu, Dec 06, 2018 at 11:25:59AM +0100, Peter Zijlstra wrote:
> On Thu, Dec 06, 2018 at 12:28:26AM -0800, Nadav Amit wrote:
> > [ +Peter ]
> > 
> > So I dug some more (I’m still not done), and found various trivial things
> > (e.g., storing zero extending u32 immediate is shorter for registers,
> > inlining already takes place).
> > 
> > *But* there is one thing that may require some attention - patch
> > b59167ac7bafd ("x86/percpu: Fix this_cpu_read()”) set ordering constraints
> > on the VM_ARGS() evaluation. And this patch also imposes, it appears,
> > (unnecessary) constraints on other pieces of code.
> > 
> > These constraints are due to the addition of the volatile keyword for
> > this_cpu_read() by the patch. This affects at least 68 functions in my
> > kernel build, some of which are hot (I think), e.g., finish_task_switch(),
> > smp_x86_platform_ipi() and select_idle_sibling().
> > 
> > Peter, perhaps the solution was too big of a hammer? Is it possible instead
> > to create a separate "this_cpu_read_once()” with the volatile keyword? Such
> > a function can be used for native_sched_clock() and other seqlocks, etc.
> 
> No. like the commit writes this_cpu_read() _must_ imply READ_ONCE(). If
> you want something else, use something else, there's plenty other
> options available.

The thing is, the this_cpu_*() things are defined IRQ-safe, this means
the values are subject to change from IRQs, and thus must be reloaded.

Also (the generic form):

  local_irq_save()
  __this_cpu_read()
  local_irq_restore()

would not be allowed to be munged like that.

Which raises the point that percpu_from_op() and the other also need
that volatile.

__this_cpu_*() OTOH assumes external preempt/IRQ disabling and could
thus be allowed to move crud about.

Which would suggest the following

---
 arch/x86/include/asm/percpu.h | 224 +++++++++++++++++++++---------------------
 1 file changed, 112 insertions(+), 112 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 1a19d11cfbbd..f75ccccd71aa 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -87,7 +87,7 @@
  * don't give an lvalue though). */
 extern void __bad_percpu_size(void);
 
-#define percpu_to_op(op, var, val)			\
+#define percpu_to_op(qual, op, var, val)		\
 do {							\
 	typedef typeof(var) pto_T__;			\
 	if (0) {					\
@@ -97,22 +97,22 @@ do {							\
 	}						\
 	switch (sizeof(var)) {				\
 	case 1:						\
-		asm(op "b %1,"__percpu_arg(0)		\
+		asm qual (op "b %1,"__percpu_arg(0)	\
 		    : "+m" (var)			\
 		    : "qi" ((pto_T__)(val)));		\
 		break;					\
 	case 2:						\
-		asm(op "w %1,"__percpu_arg(0)		\
+		asm qual (op "w %1,"__percpu_arg(0)	\
 		    : "+m" (var)			\
 		    : "ri" ((pto_T__)(val)));		\
 		break;					\
 	case 4:						\
-		asm(op "l %1,"__percpu_arg(0)		\
+		asm qual (op "l %1,"__percpu_arg(0)	\
 		    : "+m" (var)			\
 		    : "ri" ((pto_T__)(val)));		\
 		break;					\
 	case 8:						\
-		asm(op "q %1,"__percpu_arg(0)		\
+		asm qual (op "q %1,"__percpu_arg(0)	\
 		    : "+m" (var)			\
 		    : "re" ((pto_T__)(val)));		\
 		break;					\
@@ -124,7 +124,7 @@ do {							\
  * Generate a percpu add to memory instruction and optimize code
  * if one is added or subtracted.
  */
-#define percpu_add_op(var, val)						\
+#define percpu_add_op(qual, var, val)					\
 do {									\
 	typedef typeof(var) pao_T__;					\
 	const int pao_ID__ = (__builtin_constant_p(val) &&		\
@@ -138,41 +138,41 @@ do {									\
 	switch (sizeof(var)) {						\
 	case 1:								\
 		if (pao_ID__ == 1)					\
-			asm("incb "__percpu_arg(0) : "+m" (var));	\
+			asm qual ("incb "__percpu_arg(0) : "+m" (var));	\
 		else if (pao_ID__ == -1)				\
-			asm("decb "__percpu_arg(0) : "+m" (var));	\
+			asm qual ("decb "__percpu_arg(0) : "+m" (var));	\
 		else							\
-			asm("addb %1, "__percpu_arg(0)			\
+			asm qual ("addb %1, "__percpu_arg(0)		\
 			    : "+m" (var)				\
 			    : "qi" ((pao_T__)(val)));			\
 		break;							\
 	case 2:								\
 		if (pao_ID__ == 1)					\
-			asm("incw "__percpu_arg(0) : "+m" (var));	\
+			asm qual ("incw "__percpu_arg(0) : "+m" (var));	\
 		else if (pao_ID__ == -1)				\
-			asm("decw "__percpu_arg(0) : "+m" (var));	\
+			asm qual ("decw "__percpu_arg(0) : "+m" (var));	\
 		else							\
-			asm("addw %1, "__percpu_arg(0)			\
+			asm qual ("addw %1, "__percpu_arg(0)		\
 			    : "+m" (var)				\
 			    : "ri" ((pao_T__)(val)));			\
 		break;							\
 	case 4:								\
 		if (pao_ID__ == 1)					\
-			asm("incl "__percpu_arg(0) : "+m" (var));	\
+			asm qual ("incl "__percpu_arg(0) : "+m" (var));	\
 		else if (pao_ID__ == -1)				\
-			asm("decl "__percpu_arg(0) : "+m" (var));	\
+			asm qual ("decl "__percpu_arg(0) : "+m" (var));	\
 		else							\
-			asm("addl %1, "__percpu_arg(0)			\
+			asm qual ("addl %1, "__percpu_arg(0)		\
 			    : "+m" (var)				\
 			    : "ri" ((pao_T__)(val)));			\
 		break;							\
 	case 8:								\
 		if (pao_ID__ == 1)					\
-			asm("incq "__percpu_arg(0) : "+m" (var));	\
+			asm qual ("incq "__percpu_arg(0) : "+m" (var));	\
 		else if (pao_ID__ == -1)				\
-			asm("decq "__percpu_arg(0) : "+m" (var));	\
+			asm qual ("decq "__percpu_arg(0) : "+m" (var));	\
 		else							\
-			asm("addq %1, "__percpu_arg(0)			\
+			asm qual ("addq %1, "__percpu_arg(0)		\
 			    : "+m" (var)				\
 			    : "re" ((pao_T__)(val)));			\
 		break;							\
@@ -180,27 +180,27 @@ do {									\
 	}								\
 } while (0)
 
-#define percpu_from_op(op, var)				\
+#define percpu_from_op(qual, op, var)			\
 ({							\
 	typeof(var) pfo_ret__;				\
 	switch (sizeof(var)) {				\
 	case 1:						\
-		asm volatile(op "b "__percpu_arg(1)",%0"\
+		asm qual (op "b "__percpu_arg(1)",%0"	\
 		    : "=q" (pfo_ret__)			\
 		    : "m" (var));			\
 		break;					\
 	case 2:						\
-		asm volatile(op "w "__percpu_arg(1)",%0"\
+		asm qual (op "w "__percpu_arg(1)",%0"	\
 		    : "=r" (pfo_ret__)			\
 		    : "m" (var));			\
 		break;					\
 	case 4:						\
-		asm volatile(op "l "__percpu_arg(1)",%0"\
+		asm qual (op "l "__percpu_arg(1)",%0"	\
 		    : "=r" (pfo_ret__)			\
 		    : "m" (var));			\
 		break;					\
 	case 8:						\
-		asm volatile(op "q "__percpu_arg(1)",%0"\
+		asm qual (op "q "__percpu_arg(1)",%0"	\
 		    : "=r" (pfo_ret__)			\
 		    : "m" (var));			\
 		break;					\
@@ -238,23 +238,23 @@ do {									\
 	pfo_ret__;					\
 })
 
-#define percpu_unary_op(op, var)			\
+#define percpu_unary_op(qual, op, var)			\
 ({							\
 	switch (sizeof(var)) {				\
 	case 1:						\
-		asm(op "b "__percpu_arg(0)		\
+		asm qual (op "b "__percpu_arg(0)	\
 		    : "+m" (var));			\
 		break;					\
 	case 2:						\
-		asm(op "w "__percpu_arg(0)		\
+		asm qual (op "w "__percpu_arg(0)	\
 		    : "+m" (var));			\
 		break;					\
 	case 4:						\
-		asm(op "l "__percpu_arg(0)		\
+		asm qual (op "l "__percpu_arg(0)	\
 		    : "+m" (var));			\
 		break;					\
 	case 8:						\
-		asm(op "q "__percpu_arg(0)		\
+		asm qual (op "q "__percpu_arg(0)	\
 		    : "+m" (var));			\
 		break;					\
 	default: __bad_percpu_size();			\
@@ -264,27 +264,27 @@ do {									\
 /*
  * Add return operation
  */
-#define percpu_add_return_op(var, val)					\
+#define percpu_add_return_op(qual, var, val)				\
 ({									\
 	typeof(var) paro_ret__ = val;					\
 	switch (sizeof(var)) {						\
 	case 1:								\
-		asm("xaddb %0, "__percpu_arg(1)				\
+		asm qual ("xaddb %0, "__percpu_arg(1)			\
 			    : "+q" (paro_ret__), "+m" (var)		\
 			    : : "memory");				\
 		break;							\
 	case 2:								\
-		asm("xaddw %0, "__percpu_arg(1)				\
+		asm qual ("xaddw %0, "__percpu_arg(1)			\
 			    : "+r" (paro_ret__), "+m" (var)		\
 			    : : "memory");				\
 		break;							\
 	case 4:								\
-		asm("xaddl %0, "__percpu_arg(1)				\
+		asm qual ("xaddl %0, "__percpu_arg(1)			\
 			    : "+r" (paro_ret__), "+m" (var)		\
 			    : : "memory");				\
 		break;							\
 	case 8:								\
-		asm("xaddq %0, "__percpu_arg(1)				\
+		asm qual ("xaddq %0, "__percpu_arg(1)			\
 			    : "+re" (paro_ret__), "+m" (var)		\
 			    : : "memory");				\
 		break;							\
@@ -299,13 +299,13 @@ do {									\
  * expensive due to the implied lock prefix.  The processor cannot prefetch
  * cachelines if xchg is used.
  */
-#define percpu_xchg_op(var, nval)					\
+#define percpu_xchg_op(qual, var, nval)					\
 ({									\
 	typeof(var) pxo_ret__;						\
 	typeof(var) pxo_new__ = (nval);					\
 	switch (sizeof(var)) {						\
 	case 1:								\
-		asm("\n\tmov "__percpu_arg(1)",%%al"			\
+		asm qual ("\n\tmov "__percpu_arg(1)",%%al"		\
 		    "\n1:\tcmpxchgb %2, "__percpu_arg(1)		\
 		    "\n\tjnz 1b"					\
 			    : "=&a" (pxo_ret__), "+m" (var)		\
@@ -313,7 +313,7 @@ do {									\
 			    : "memory");				\
 		break;							\
 	case 2:								\
-		asm("\n\tmov "__percpu_arg(1)",%%ax"			\
+		asm qual ("\n\tmov "__percpu_arg(1)",%%ax"		\
 		    "\n1:\tcmpxchgw %2, "__percpu_arg(1)		\
 		    "\n\tjnz 1b"					\
 			    : "=&a" (pxo_ret__), "+m" (var)		\
@@ -321,7 +321,7 @@ do {									\
 			    : "memory");				\
 		break;							\
 	case 4:								\
-		asm("\n\tmov "__percpu_arg(1)",%%eax"			\
+		asm qual ("\n\tmov "__percpu_arg(1)",%%eax"		\
 		    "\n1:\tcmpxchgl %2, "__percpu_arg(1)		\
 		    "\n\tjnz 1b"					\
 			    : "=&a" (pxo_ret__), "+m" (var)		\
@@ -329,7 +329,7 @@ do {									\
 			    : "memory");				\
 		break;							\
 	case 8:								\
-		asm("\n\tmov "__percpu_arg(1)",%%rax"			\
+		asm qual ("\n\tmov "__percpu_arg(1)",%%rax"		\
 		    "\n1:\tcmpxchgq %2, "__percpu_arg(1)		\
 		    "\n\tjnz 1b"					\
 			    : "=&a" (pxo_ret__), "+m" (var)		\
@@ -345,32 +345,32 @@ do {									\
  * cmpxchg has no such implied lock semantics as a result it is much
  * more efficient for cpu local operations.
  */
-#define percpu_cmpxchg_op(var, oval, nval)				\
+#define percpu_cmpxchg_op(qual, var, oval, nval)			\
 ({									\
 	typeof(var) pco_ret__;						\
 	typeof(var) pco_old__ = (oval);					\
 	typeof(var) pco_new__ = (nval);					\
 	switch (sizeof(var)) {						\
 	case 1:								\
-		asm("cmpxchgb %2, "__percpu_arg(1)			\
+		asm qual ("cmpxchgb %2, "__percpu_arg(1)		\
 			    : "=a" (pco_ret__), "+m" (var)		\
 			    : "q" (pco_new__), "0" (pco_old__)		\
 			    : "memory");				\
 		break;							\
 	case 2:								\
-		asm("cmpxchgw %2, "__percpu_arg(1)			\
+		asm qual ("cmpxchgw %2, "__percpu_arg(1)		\
 			    : "=a" (pco_ret__), "+m" (var)		\
 			    : "r" (pco_new__), "0" (pco_old__)		\
 			    : "memory");				\
 		break;							\
 	case 4:								\
-		asm("cmpxchgl %2, "__percpu_arg(1)			\
+		asm qual ("cmpxchgl %2, "__percpu_arg(1)		\
 			    : "=a" (pco_ret__), "+m" (var)		\
 			    : "r" (pco_new__), "0" (pco_old__)		\
 			    : "memory");				\
 		break;							\
 	case 8:								\
-		asm("cmpxchgq %2, "__percpu_arg(1)			\
+		asm qual ("cmpxchgq %2, "__percpu_arg(1)		\
 			    : "=a" (pco_ret__), "+m" (var)		\
 			    : "r" (pco_new__), "0" (pco_old__)		\
 			    : "memory");				\
@@ -391,58 +391,58 @@ do {									\
  */
 #define this_cpu_read_stable(var)	percpu_stable_op("mov", var)
 
-#define raw_cpu_read_1(pcp)		percpu_from_op("mov", pcp)
-#define raw_cpu_read_2(pcp)		percpu_from_op("mov", pcp)
-#define raw_cpu_read_4(pcp)		percpu_from_op("mov", pcp)
-
-#define raw_cpu_write_1(pcp, val)	percpu_to_op("mov", (pcp), val)
-#define raw_cpu_write_2(pcp, val)	percpu_to_op("mov", (pcp), val)
-#define raw_cpu_write_4(pcp, val)	percpu_to_op("mov", (pcp), val)
-#define raw_cpu_add_1(pcp, val)		percpu_add_op((pcp), val)
-#define raw_cpu_add_2(pcp, val)		percpu_add_op((pcp), val)
-#define raw_cpu_add_4(pcp, val)		percpu_add_op((pcp), val)
-#define raw_cpu_and_1(pcp, val)		percpu_to_op("and", (pcp), val)
-#define raw_cpu_and_2(pcp, val)		percpu_to_op("and", (pcp), val)
-#define raw_cpu_and_4(pcp, val)		percpu_to_op("and", (pcp), val)
-#define raw_cpu_or_1(pcp, val)		percpu_to_op("or", (pcp), val)
-#define raw_cpu_or_2(pcp, val)		percpu_to_op("or", (pcp), val)
-#define raw_cpu_or_4(pcp, val)		percpu_to_op("or", (pcp), val)
-#define raw_cpu_xchg_1(pcp, val)	percpu_xchg_op(pcp, val)
-#define raw_cpu_xchg_2(pcp, val)	percpu_xchg_op(pcp, val)
-#define raw_cpu_xchg_4(pcp, val)	percpu_xchg_op(pcp, val)
-
-#define this_cpu_read_1(pcp)		percpu_from_op("mov", pcp)
-#define this_cpu_read_2(pcp)		percpu_from_op("mov", pcp)
-#define this_cpu_read_4(pcp)		percpu_from_op("mov", pcp)
-#define this_cpu_write_1(pcp, val)	percpu_to_op("mov", (pcp), val)
-#define this_cpu_write_2(pcp, val)	percpu_to_op("mov", (pcp), val)
-#define this_cpu_write_4(pcp, val)	percpu_to_op("mov", (pcp), val)
-#define this_cpu_add_1(pcp, val)	percpu_add_op((pcp), val)
-#define this_cpu_add_2(pcp, val)	percpu_add_op((pcp), val)
-#define this_cpu_add_4(pcp, val)	percpu_add_op((pcp), val)
-#define this_cpu_and_1(pcp, val)	percpu_to_op("and", (pcp), val)
-#define this_cpu_and_2(pcp, val)	percpu_to_op("and", (pcp), val)
-#define this_cpu_and_4(pcp, val)	percpu_to_op("and", (pcp), val)
-#define this_cpu_or_1(pcp, val)		percpu_to_op("or", (pcp), val)
-#define this_cpu_or_2(pcp, val)		percpu_to_op("or", (pcp), val)
-#define this_cpu_or_4(pcp, val)		percpu_to_op("or", (pcp), val)
-#define this_cpu_xchg_1(pcp, nval)	percpu_xchg_op(pcp, nval)
-#define this_cpu_xchg_2(pcp, nval)	percpu_xchg_op(pcp, nval)
-#define this_cpu_xchg_4(pcp, nval)	percpu_xchg_op(pcp, nval)
-
-#define raw_cpu_add_return_1(pcp, val)		percpu_add_return_op(pcp, val)
-#define raw_cpu_add_return_2(pcp, val)		percpu_add_return_op(pcp, val)
-#define raw_cpu_add_return_4(pcp, val)		percpu_add_return_op(pcp, val)
-#define raw_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
-#define raw_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
-#define raw_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
-
-#define this_cpu_add_return_1(pcp, val)		percpu_add_return_op(pcp, val)
-#define this_cpu_add_return_2(pcp, val)		percpu_add_return_op(pcp, val)
-#define this_cpu_add_return_4(pcp, val)		percpu_add_return_op(pcp, val)
-#define this_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
-#define this_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
-#define this_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
+#define raw_cpu_read_1(pcp)		percpu_from_op(, "mov", pcp)
+#define raw_cpu_read_2(pcp)		percpu_from_op(, "mov", pcp)
+#define raw_cpu_read_4(pcp)		percpu_from_op(, "mov", pcp)
+
+#define raw_cpu_write_1(pcp, val)	percpu_to_op(, "mov", (pcp), val)
+#define raw_cpu_write_2(pcp, val)	percpu_to_op(, "mov", (pcp), val)
+#define raw_cpu_write_4(pcp, val)	percpu_to_op(, "mov", (pcp), val)
+#define raw_cpu_add_1(pcp, val)		percpu_add_op(, (pcp), val)
+#define raw_cpu_add_2(pcp, val)		percpu_add_op(, (pcp), val)
+#define raw_cpu_add_4(pcp, val)		percpu_add_op(, (pcp), val)
+#define raw_cpu_and_1(pcp, val)		percpu_to_op(, "and", (pcp), val)
+#define raw_cpu_and_2(pcp, val)		percpu_to_op(, "and", (pcp), val)
+#define raw_cpu_and_4(pcp, val)		percpu_to_op(, "and", (pcp), val)
+#define raw_cpu_or_1(pcp, val)		percpu_to_op(, "or", (pcp), val)
+#define raw_cpu_or_2(pcp, val)		percpu_to_op(, "or", (pcp), val)
+#define raw_cpu_or_4(pcp, val)		percpu_to_op(, "or", (pcp), val)
+#define raw_cpu_xchg_1(pcp, val)	percpu_xchg_op(, pcp, val)
+#define raw_cpu_xchg_2(pcp, val)	percpu_xchg_op(, pcp, val)
+#define raw_cpu_xchg_4(pcp, val)	percpu_xchg_op(, pcp, val)
+
+#define this_cpu_read_1(pcp)		percpu_from_op(volatile, "mov", pcp)
+#define this_cpu_read_2(pcp)		percpu_from_op(volatile, "mov", pcp)
+#define this_cpu_read_4(pcp)		percpu_from_op(volatile, "mov", pcp)
+#define this_cpu_write_1(pcp, val)	percpu_to_op(volatile, "mov", (pcp), val)
+#define this_cpu_write_2(pcp, val)	percpu_to_op(volatile, "mov", (pcp), val)
+#define this_cpu_write_4(pcp, val)	percpu_to_op(volatile, "mov", (pcp), val)
+#define this_cpu_add_1(pcp, val)	percpu_add_op(volatile, (pcp), val)
+#define this_cpu_add_2(pcp, val)	percpu_add_op(volatile, (pcp), val)
+#define this_cpu_add_4(pcp, val)	percpu_add_op(volatile, (pcp), val)
+#define this_cpu_and_1(pcp, val)	percpu_to_op(volatile, "and", (pcp), val)
+#define this_cpu_and_2(pcp, val)	percpu_to_op(volatile, "and", (pcp), val)
+#define this_cpu_and_4(pcp, val)	percpu_to_op(volatile, "and", (pcp), val)
+#define this_cpu_or_1(pcp, val)		percpu_to_op(volatile, "or", (pcp), val)
+#define this_cpu_or_2(pcp, val)		percpu_to_op(volatile, "or", (pcp), val)
+#define this_cpu_or_4(pcp, val)		percpu_to_op(volatile, "or", (pcp), val)
+#define this_cpu_xchg_1(pcp, nval)	percpu_xchg_op(volatile, pcp, nval)
+#define this_cpu_xchg_2(pcp, nval)	percpu_xchg_op(volatile, pcp, nval)
+#define this_cpu_xchg_4(pcp, nval)	percpu_xchg_op(volatile, pcp, nval)
+
+#define raw_cpu_add_return_1(pcp, val)		percpu_add_return_op(, pcp, val)
+#define raw_cpu_add_return_2(pcp, val)		percpu_add_return_op(, pcp, val)
+#define raw_cpu_add_return_4(pcp, val)		percpu_add_return_op(, pcp, val)
+#define raw_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(, pcp, oval, nval)
+#define raw_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(, pcp, oval, nval)
+#define raw_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(, pcp, oval, nval)
+
+#define this_cpu_add_return_1(pcp, val)		percpu_add_return_op(volatile, pcp, val)
+#define this_cpu_add_return_2(pcp, val)		percpu_add_return_op(volatile, pcp, val)
+#define this_cpu_add_return_4(pcp, val)		percpu_add_return_op(volatile, pcp, val)
+#define this_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(volatile, pcp, oval, nval)
+#define this_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(volatile, pcp, oval, nval)
+#define this_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(volatile, pcp, oval, nval)
 
 #ifdef CONFIG_X86_CMPXCHG64
 #define percpu_cmpxchg8b_double(pcp1, pcp2, o1, o2, n1, n2)		\
@@ -466,23 +466,23 @@ do {									\
  * 32 bit must fall back to generic operations.
  */
 #ifdef CONFIG_X86_64
-#define raw_cpu_read_8(pcp)			percpu_from_op("mov", pcp)
-#define raw_cpu_write_8(pcp, val)		percpu_to_op("mov", (pcp), val)
-#define raw_cpu_add_8(pcp, val)			percpu_add_op((pcp), val)
-#define raw_cpu_and_8(pcp, val)			percpu_to_op("and", (pcp), val)
-#define raw_cpu_or_8(pcp, val)			percpu_to_op("or", (pcp), val)
-#define raw_cpu_add_return_8(pcp, val)		percpu_add_return_op(pcp, val)
-#define raw_cpu_xchg_8(pcp, nval)		percpu_xchg_op(pcp, nval)
-#define raw_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
-
-#define this_cpu_read_8(pcp)			percpu_from_op("mov", pcp)
-#define this_cpu_write_8(pcp, val)		percpu_to_op("mov", (pcp), val)
-#define this_cpu_add_8(pcp, val)		percpu_add_op((pcp), val)
-#define this_cpu_and_8(pcp, val)		percpu_to_op("and", (pcp), val)
-#define this_cpu_or_8(pcp, val)			percpu_to_op("or", (pcp), val)
-#define this_cpu_add_return_8(pcp, val)		percpu_add_return_op(pcp, val)
-#define this_cpu_xchg_8(pcp, nval)		percpu_xchg_op(pcp, nval)
-#define this_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
+#define raw_cpu_read_8(pcp)			percpu_from_op(, "mov", pcp)
+#define raw_cpu_write_8(pcp, val)		percpu_to_op(, "mov", (pcp), val)
+#define raw_cpu_add_8(pcp, val)			percpu_add_op(, (pcp), val)
+#define raw_cpu_and_8(pcp, val)			percpu_to_op(, "and", (pcp), val)
+#define raw_cpu_or_8(pcp, val)			percpu_to_op(, "or", (pcp), val)
+#define raw_cpu_add_return_8(pcp, val)		percpu_add_return_op(, pcp, val)
+#define raw_cpu_xchg_8(pcp, nval)		percpu_xchg_op(, pcp, nval)
+#define raw_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(, pcp, oval, nval)
+
+#define this_cpu_read_8(pcp)			percpu_from_op(volatile, "mov", pcp)
+#define this_cpu_write_8(pcp, val)		percpu_to_op(volatile, "mov", (pcp), val)
+#define this_cpu_add_8(pcp, val)		percpu_add_op(volatile, (pcp), val)
+#define this_cpu_and_8(pcp, val)		percpu_to_op(volatile, "and", (pcp), val)
+#define this_cpu_or_8(pcp, val)			percpu_to_op(volatile, "or", (pcp), val)
+#define this_cpu_add_return_8(pcp, val)		percpu_add_return_op(volatile, pcp, val)
+#define this_cpu_xchg_8(pcp, nval)		percpu_xchg_op(volatile, pcp, nval)
+#define this_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(volatile, pcp, oval, nval)
 
 /*
  * Pretty complex macro to generate cmpxchg16 instruction.  The instruction

Nadav Amit Dec. 6, 2018, 5:26 p.m. UTC | #9

> On Dec 6, 2018, at 2:25 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, Dec 06, 2018 at 12:28:26AM -0800, Nadav Amit wrote:
>> [ +Peter ]
>> 
>> So I dug some more (I’m still not done), and found various trivial things
>> (e.g., storing zero extending u32 immediate is shorter for registers,
>> inlining already takes place).
>> 
>> *But* there is one thing that may require some attention - patch
>> b59167ac7bafd ("x86/percpu: Fix this_cpu_read()”) set ordering constraints
>> on the VM_ARGS() evaluation. And this patch also imposes, it appears,
>> (unnecessary) constraints on other pieces of code.
>> 
>> These constraints are due to the addition of the volatile keyword for
>> this_cpu_read() by the patch. This affects at least 68 functions in my
>> kernel build, some of which are hot (I think), e.g., finish_task_switch(),
>> smp_x86_platform_ipi() and select_idle_sibling().
>> 
>> Peter, perhaps the solution was too big of a hammer? Is it possible instead
>> to create a separate "this_cpu_read_once()” with the volatile keyword? Such
>> a function can be used for native_sched_clock() and other seqlocks, etc.
> 
> No. like the commit writes this_cpu_read() _must_ imply READ_ONCE(). If
> you want something else, use something else, there's plenty other
> options available.
> 
> There's this_cpu_op_stable(), but also __this_cpu_read() and
> raw_this_cpu_read() (which currently don't differ from this_cpu_read()
> but could).

Would setting the inline assembly memory operand both as input and output be
better than using the “volatile”?

I think that If you do that, the compiler would should the this_cpu_read()
as something that changes the per-cpu-variable, which would make it invalid
to re-read the value. At the same time, it would not prevent reordering the
read with other stuff.

Peter Zijlstra Dec. 7, 2018, 8:45 a.m. UTC | #10

On Thu, Dec 06, 2018 at 09:26:24AM -0800, Nadav Amit wrote:
> > On Dec 6, 2018, at 2:25 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > On Thu, Dec 06, 2018 at 12:28:26AM -0800, Nadav Amit wrote:
> >> [ +Peter ]
> >> 
> >> So I dug some more (I’m still not done), and found various trivial things
> >> (e.g., storing zero extending u32 immediate is shorter for registers,
> >> inlining already takes place).
> >> 
> >> *But* there is one thing that may require some attention - patch
> >> b59167ac7bafd ("x86/percpu: Fix this_cpu_read()”) set ordering constraints
> >> on the VM_ARGS() evaluation. And this patch also imposes, it appears,
> >> (unnecessary) constraints on other pieces of code.
> >> 
> >> These constraints are due to the addition of the volatile keyword for
> >> this_cpu_read() by the patch. This affects at least 68 functions in my
> >> kernel build, some of which are hot (I think), e.g., finish_task_switch(),
> >> smp_x86_platform_ipi() and select_idle_sibling().
> >> 
> >> Peter, perhaps the solution was too big of a hammer? Is it possible instead
> >> to create a separate "this_cpu_read_once()” with the volatile keyword? Such
> >> a function can be used for native_sched_clock() and other seqlocks, etc.
> > 
> > No. like the commit writes this_cpu_read() _must_ imply READ_ONCE(). If
> > you want something else, use something else, there's plenty other
> > options available.
> > 
> > There's this_cpu_op_stable(), but also __this_cpu_read() and
> > raw_this_cpu_read() (which currently don't differ from this_cpu_read()
> > but could).
> 
> Would setting the inline assembly memory operand both as input and output be
> better than using the “volatile”?

I don't know.. I'm forever befuddled by the exact semantics of gcc
inline asm.

> I think that If you do that, the compiler would should the this_cpu_read()
> as something that changes the per-cpu-variable, which would make it invalid
> to re-read the value. At the same time, it would not prevent reordering the
> read with other stuff.

So the thing is; as I wrote, the generic version of this_cpu_*() is:

	local_irq_save();
	__this_cpu_*();
	local_irq_restore();

And per local_irq_{save,restore}() including compiler barriers that
cannot be reordered around either.

And per the principle of least surprise, I think our primitives should
have similar semantics.


I'm actually having difficulty finding the this_cpu_read() in any of the
functions you mention, so I cannot make any concrete suggestions other
than pointing at the alternative functions available.

Nadav Amit Dec. 7, 2018, 11:12 p.m. UTC | #11

[ We can start a new thread, since I have the tendency to hijack threads. ]

> On Dec 7, 2018, at 12:45 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, Dec 06, 2018 at 09:26:24AM -0800, Nadav Amit wrote:
>>> On Dec 6, 2018, at 2:25 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> 
>>> On Thu, Dec 06, 2018 at 12:28:26AM -0800, Nadav Amit wrote:
>>>> [ +Peter ]
>>>> 
>>>> So I dug some more (I’m still not done), and found various trivial things
>>>> (e.g., storing zero extending u32 immediate is shorter for registers,
>>>> inlining already takes place).
>>>> 
>>>> *But* there is one thing that may require some attention - patch
>>>> b59167ac7bafd ("x86/percpu: Fix this_cpu_read()”) set ordering constraints
>>>> on the VM_ARGS() evaluation. And this patch also imposes, it appears,
>>>> (unnecessary) constraints on other pieces of code.
>>>> 
>>>> These constraints are due to the addition of the volatile keyword for
>>>> this_cpu_read() by the patch. This affects at least 68 functions in my
>>>> kernel build, some of which are hot (I think), e.g., finish_task_switch(),
>>>> smp_x86_platform_ipi() and select_idle_sibling().
>>>> 
>>>> Peter, perhaps the solution was too big of a hammer? Is it possible instead
>>>> to create a separate "this_cpu_read_once()” with the volatile keyword? Such
>>>> a function can be used for native_sched_clock() and other seqlocks, etc.
>>> 
>>> No. like the commit writes this_cpu_read() _must_ imply READ_ONCE(). If
>>> you want something else, use something else, there's plenty other
>>> options available.
>>> 
>>> There's this_cpu_op_stable(), but also __this_cpu_read() and
>>> raw_this_cpu_read() (which currently don't differ from this_cpu_read()
>>> but could).
>> 
>> Would setting the inline assembly memory operand both as input and output be
>> better than using the “volatile”?
> 
> I don't know.. I'm forever befuddled by the exact semantics of gcc
> inline asm.
> 
>> I think that If you do that, the compiler would should the this_cpu_read()
>> as something that changes the per-cpu-variable, which would make it invalid
>> to re-read the value. At the same time, it would not prevent reordering the
>> read with other stuff.
> 
> So the thing is; as I wrote, the generic version of this_cpu_*() is:
> 
> 	local_irq_save();
> 	__this_cpu_*();
> 	local_irq_restore();
> 
> And per local_irq_{save,restore}() including compiler barriers that
> cannot be reordered around either.
> 
> And per the principle of least surprise, I think our primitives should
> have similar semantics.

I guess so, but as you’ll see below, the end result is ugly.

> I'm actually having difficulty finding the this_cpu_read() in any of the
> functions you mention, so I cannot make any concrete suggestions other
> than pointing at the alternative functions available.


So I got deeper into the code to understand a couple of differences. In the
case of select_idle_sibling(), the patch (Peter’s) increase the function
code size by 123 bytes (over the baseline of 986). The per-cpu variable is
called through the following call chain:

	select_idle_sibling()
	=> select_idle_cpu()
	=> local_clock()
	=> raw_smp_processor_id()

And results in 2 more calls to sched_clock_cpu(), as the compiler assumes
the processor id changes in between (which obviously wouldn’t happen). There
may be more changes around, which I didn’t fully analyze. But the very least
reading the processor id should not get “volatile”.

As for finish_task_switch(), the impact is only few bytes, but still
unnecessary. It appears that with your patch preempt_count() causes multiple
reads of __preempt_count in this code:

        if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET,
                      "corrupted preempt_count: %s/%d/0x%x\n",
                      current->comm, current->pid, preempt_count()))
                preempt_count_set(FORK_PREEMPT_COUNT);

Again, this is unwarranted, as the preemption count should not be changed in
any interrupt.

Nadav Amit Dec. 8, 2018, 12:40 a.m. UTC | #12

[Resend, changing title & adding lkml and some others ]

On Dec 7, 2018, at 3:12 PM, Nadav Amit <nadav.amit@gmail.com> wrote:

[ We can start a new thread, since I have the tendency to hijack threads. ]

> On Dec 7, 2018, at 12:45 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, Dec 06, 2018 at 09:26:24AM -0800, Nadav Amit wrote:
>>> On Dec 6, 2018, at 2:25 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> 
>>> On Thu, Dec 06, 2018 at 12:28:26AM -0800, Nadav Amit wrote:
>>>> [ +Peter ]
>>>> 

[snip]

>>>> 
>>>> *But* there is one thing that may require some attention - patch
>>>> b59167ac7bafd ("x86/percpu: Fix this_cpu_read()”) set ordering constraints
>>>> on the VM_ARGS() evaluation. And this patch also imposes, it appears,
>>>> (unnecessary) constraints on other pieces of code.
>>>> 
>>>> These constraints are due to the addition of the volatile keyword for
>>>> this_cpu_read() by the patch. This affects at least 68 functions in my
>>>> kernel build, some of which are hot (I think), e.g., finish_task_switch(),
>>>> smp_x86_platform_ipi() and select_idle_sibling().
>>>> 
>>>> Peter, perhaps the solution was too big of a hammer? Is it possible instead
>>>> to create a separate "this_cpu_read_once()” with the volatile keyword? Such
>>>> a function can be used for native_sched_clock() and other seqlocks, etc.
>>> 
>>> No. like the commit writes this_cpu_read() _must_ imply READ_ONCE(). If
>>> you want something else, use something else, there's plenty other
>>> options available.
>>> 
>>> There's this_cpu_op_stable(), but also __this_cpu_read() and
>>> raw_this_cpu_read() (which currently don't differ from this_cpu_read()
>>> but could).
>> 
>> Would setting the inline assembly memory operand both as input and output be
>> better than using the “volatile”?
> 
> I don't know.. I'm forever befuddled by the exact semantics of gcc
> inline asm.
> 
>> I think that If you do that, the compiler would should the this_cpu_read()
>> as something that changes the per-cpu-variable, which would make it invalid
>> to re-read the value. At the same time, it would not prevent reordering the
>> read with other stuff.
> 
> So the thing is; as I wrote, the generic version of this_cpu_*() is:
> 
> 	local_irq_save();
> 	__this_cpu_*();
> 	local_irq_restore();
> 
> And per local_irq_{save,restore}() including compiler barriers that
> cannot be reordered around either.
> 
> And per the principle of least surprise, I think our primitives should
> have similar semantics.

I guess so, but as you’ll see below, the end result is ugly.

> I'm actually having difficulty finding the this_cpu_read() in any of the
> functions you mention, so I cannot make any concrete suggestions other
> than pointing at the alternative functions available.


So I got deeper into the code to understand a couple of differences. In the
case of select_idle_sibling(), the patch (Peter’s) increase the function
code size by 123 bytes (over the baseline of 986). The per-cpu variable is
called through the following call chain:

	select_idle_sibling()
	=> select_idle_cpu()
	=> local_clock()
	=> raw_smp_processor_id()

And results in 2 more calls to sched_clock_cpu(), as the compiler assumes
the processor id changes in between (which obviously wouldn’t happen). There
may be more changes around, which I didn’t fully analyze. But the very least
reading the processor id should not get “volatile”.

As for finish_task_switch(), the impact is only few bytes, but still
unnecessary. It appears that with your patch preempt_count() causes multiple
reads of __preempt_count in this code:

       if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET,
                     "corrupted preempt_count: %s/%d/0x%x\n",
                     current->comm, current->pid, preempt_count()))
               preempt_count_set(FORK_PREEMPT_COUNT);

Again, this is unwarranted, as the preemption count should not be changed in
any interrupt.

Peter Zijlstra Dec. 8, 2018, 10:52 a.m. UTC | #13

On Fri, Dec 07, 2018 at 04:40:52PM -0800, Nadav Amit wrote:

> > I'm actually having difficulty finding the this_cpu_read() in any of the
> > functions you mention, so I cannot make any concrete suggestions other
> > than pointing at the alternative functions available.
> 
> 
> So I got deeper into the code to understand a couple of differences. In the
> case of select_idle_sibling(), the patch (Peter’s) increase the function
> code size by 123 bytes (over the baseline of 986). The per-cpu variable is
> called through the following call chain:
> 
> 	select_idle_sibling()
> 	=> select_idle_cpu()
> 	=> local_clock()
> 	=> raw_smp_processor_id()
> 
> And results in 2 more calls to sched_clock_cpu(), as the compiler assumes
> the processor id changes in between (which obviously wouldn’t happen).

That is the thing with raw_smp_processor_id(), it is allowed to be used
in preemptible context, and there it _obviously_ can change between
subsequent invocations.

So again, this change is actually good.

If we want to fix select_idle_cpu(), we should maybe not use
local_clock() there but use sched_clock_cpu() with a stable argument,
this code runs with IRQs disabled and therefore the CPU number is stable
for us here.

> There may be more changes around, which I didn’t fully analyze. But
> the very least reading the processor id should not get “volatile”.
> 
> As for finish_task_switch(), the impact is only few bytes, but still
> unnecessary. It appears that with your patch preempt_count() causes multiple
> reads of __preempt_count in this code:
> 
>        if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET,
>                      "corrupted preempt_count: %s/%d/0x%x\n",
>                      current->comm, current->pid, preempt_count()))
>                preempt_count_set(FORK_PREEMPT_COUNT);

My patch proposed here:

  https://marc.info/?l=linux-mm&m=154409548410209

would actually fix that one I think, preempt_count() uses
raw_cpu_read_4() which will loose the volatile with that patch.

Nadav Amit Dec. 10, 2018, 12:57 a.m. UTC | #14

> On Dec 8, 2018, at 2:52 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Fri, Dec 07, 2018 at 04:40:52PM -0800, Nadav Amit wrote:
> 
>>> I'm actually having difficulty finding the this_cpu_read() in any of the
>>> functions you mention, so I cannot make any concrete suggestions other
>>> than pointing at the alternative functions available.
>> 
>> 
>> So I got deeper into the code to understand a couple of differences. In the
>> case of select_idle_sibling(), the patch (Peter’s) increase the function
>> code size by 123 bytes (over the baseline of 986). The per-cpu variable is
>> called through the following call chain:
>> 
>> 	select_idle_sibling()
>> 	=> select_idle_cpu()
>> 	=> local_clock()
>> 	=> raw_smp_processor_id()
>> 
>> And results in 2 more calls to sched_clock_cpu(), as the compiler assumes
>> the processor id changes in between (which obviously wouldn’t happen).
> 
> That is the thing with raw_smp_processor_id(), it is allowed to be used
> in preemptible context, and there it _obviously_ can change between
> subsequent invocations.
> 
> So again, this change is actually good.
> 
> If we want to fix select_idle_cpu(), we should maybe not use
> local_clock() there but use sched_clock_cpu() with a stable argument,
> this code runs with IRQs disabled and therefore the CPU number is stable
> for us here.
> 
>> There may be more changes around, which I didn’t fully analyze. But
>> the very least reading the processor id should not get “volatile”.
>> 
>> As for finish_task_switch(), the impact is only few bytes, but still
>> unnecessary. It appears that with your patch preempt_count() causes multiple
>> reads of __preempt_count in this code:
>> 
>>       if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET,
>>                     "corrupted preempt_count: %s/%d/0x%x\n",
>>                     current->comm, current->pid, preempt_count()))
>>               preempt_count_set(FORK_PREEMPT_COUNT);
> 
> My patch proposed here:
> 
>  https://marc.info/?l=linux-mm&m=154409548410209
> 
> would actually fix that one I think, preempt_count() uses
> raw_cpu_read_4() which will loose the volatile with that patch.

Sorry for the spam from yesterday. That what happens when I try to write
emails on my phone while I’m distracted.

I tested the patch you referenced, and it certainly improves the situation
for reads, but there are still small and big issues lying around.

The biggest one is that (I think) smp_processor_id() should apparently use
__this_cpu_read(). Anyhow, when preemption checks are on, it is validated
that smp_processor_id() is not used while preemption is enabled, and IRQs
are not supposed to change its value. Otherwise the generated code of many
functions is affected.

There are all kind of other smaller issues, such as set_irq_regs() and
get_irq_regs(), which should run with disabled interrupts. They affect the
generated code in do_IRQ() and others.

But beyond that, there are so many places in the code that use
this_cpu_read() while IRQs are guaranteed to be disabled. For example
arch/x86/mm/tlb.c is full with this_cpu_read/write() and almost(?) all
should be running with interrupts disabled. Having said that, in my build
only flush_tlb_func_common() was affected.

Peter Zijlstra Dec. 10, 2018, 8:55 a.m. UTC | #15

On Sun, Dec 09, 2018 at 04:57:43PM -0800, Nadav Amit wrote:
> > On Dec 8, 2018, at 2:52 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> > My patch proposed here:
> > 
> >  https://marc.info/?l=linux-mm&m=154409548410209
> > 
> > would actually fix that one I think, preempt_count() uses
> > raw_cpu_read_4() which will loose the volatile with that patch.

> I tested the patch you referenced, and it certainly improves the situation
> for reads, but there are still small and big issues lying around.

I'm sure :-(, this has been 'festering' for a long while it seems. And
esp. on x86 specific code, where for a long time we all assumed the
various per-cpu APIs were in fact the same (which turns out to very much
not be true).

> The biggest one is that (I think) smp_processor_id() should apparently use
> __this_cpu_read().

Agreed, and note that this will also improve code generation on !x86.

However, I'm not sure the current !debug definition:

#define smp_processor_id() raw_smp_processor_id()

is actually correct. Where raw_smp_processor_id() must be
this_cpu_read() to avoid CSE, we actually want to allow CSE on
smp_processor_id() etc..

> There are all kind of other smaller issues, such as set_irq_regs() and
> get_irq_regs(), which should run with disabled interrupts. They affect the
> generated code in do_IRQ() and others.
> 
> But beyond that, there are so many places in the code that use
> this_cpu_read() while IRQs are guaranteed to be disabled. For example
> arch/x86/mm/tlb.c is full with this_cpu_read/write() and almost(?) all
> should be running with interrupts disabled. Having said that, in my build
> only flush_tlb_func_common() was affected.

This all feels like something static analysis could help with; such
tools would also make sense for !x86 where the difference between the
various per-cpu accessors is even bigger.

Nadav Amit Dec. 11, 2018, 5:11 p.m. UTC | #16

> On Dec 10, 2018, at 12:55 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Sun, Dec 09, 2018 at 04:57:43PM -0800, Nadav Amit wrote:
>>> On Dec 8, 2018, at 2:52 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
>>> My patch proposed here:
>>> 
>>> https://marc.info/?l=linux-mm&m=154409548410209
>>> 
>>> would actually fix that one I think, preempt_count() uses
>>> raw_cpu_read_4() which will loose the volatile with that patch.
> 
>> I tested the patch you referenced, and it certainly improves the situation
>> for reads, but there are still small and big issues lying around.
> 
> I'm sure :-(, this has been 'festering' for a long while it seems. And
> esp. on x86 specific code, where for a long time we all assumed the
> various per-cpu APIs were in fact the same (which turns out to very much
> not be true).
> 
>> The biggest one is that (I think) smp_processor_id() should apparently use
>> __this_cpu_read().
> 
> Agreed, and note that this will also improve code generation on !x86.
> 
> However, I'm not sure the current !debug definition:
> 
> #define smp_processor_id() raw_smp_processor_id()
> 
> is actually correct. Where raw_smp_processor_id() must be
> this_cpu_read() to avoid CSE, we actually want to allow CSE on
> smp_processor_id() etc..

Yes. That makes sense.

> 
>> There are all kind of other smaller issues, such as set_irq_regs() and
>> get_irq_regs(), which should run with disabled interrupts. They affect the
>> generated code in do_IRQ() and others.
>> 
>> But beyond that, there are so many places in the code that use
>> this_cpu_read() while IRQs are guaranteed to be disabled. For example
>> arch/x86/mm/tlb.c is full with this_cpu_read/write() and almost(?) all
>> should be running with interrupts disabled. Having said that, in my build
>> only flush_tlb_func_common() was affected.
> 
> This all feels like something static analysis could help with; such
> tools would also make sense for !x86 where the difference between the
> various per-cpu accessors is even bigger.

If something like that existed, it could also allow to get rid of
local_irq_save() (and use local_irq_disable() instead).

Number of arguments in vmalloc.c

Commit Message

Comments

Patch