x86-64: Maintain 16-byte stack alignment
diff mbox

Message ID 20170110143340.GA3787@gondor.apana.org.au
State Not Applicable
Delegated to: Herbert Xu
Headers show

Commit Message

Herbert Xu Jan. 10, 2017, 2:33 p.m. UTC
I recently applied the patch

	https://patchwork.kernel.org/patch/9468391/

and ended up with a boot crash when it tried to run the x86 chacha20
code.  It turned out that the patch changed a manually aligned
stack buffer to one that is aligned by gcc.  What was happening was
that gcc can stack align to any value on x86-64 except 16.  The
reason is that gcc assumes that the stack is always 16-byte aligned,
which is not actually the case in the kernel.

The x86-64 CPU actually tries to keep the stack 16-byte aligned,
e.g., it'll do so when an IRQ comes in.  So the reason it doesn't
work in the kernel mostly comes down to the fact that the struct
pt_regs which lives near the top of the stack is 168 bytes which
is not a multiple of 16.

This patch tries to fix this by adding an 8-byte padding at the
top of the call-chain involving pt_regs so that when we call a C
function later we do so with an aligned stack.

The same problem probably exists on i386 too since gcc also assumes
16-byte alignment there.  It's harder to fix however as the CPU
doesn't help us in the IRQ case.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Comments

Herbert Xu Jan. 10, 2017, 2:39 p.m. UTC | #1
On Tue, Jan 10, 2017 at 10:33:40PM +0800, Herbert Xu wrote:
> I recently applied the patch
> 
> 	https://patchwork.kernel.org/patch/9468391/
> 
> and ended up with a boot crash when it tried to run the x86 chacha20
> code.  It turned out that the patch changed a manually aligned
> stack buffer to one that is aligned by gcc.  What was happening was
> that gcc can stack align to any value on x86-64 except 16.  The
> reason is that gcc assumes that the stack is always 16-byte aligned,
> which is not actually the case in the kernel.

BTW this is with Debian gcc 4.7.2 which does not allow an 8-byte
stack alignment as attempted by the Makefile:

$ gcc -S -O2 -mno-sse -mpreferred-stack-boundary=3 a.c
a.c:1:0: error: -mpreferred-stack-boundary=3 is not between 4 and 12
$ 

Obviously this is not an issue if your compiler actually allows
the 8-byte alignment.

Cheers,
Linus Torvalds Jan. 10, 2017, 5:05 p.m. UTC | #2
On Tue, Jan 10, 2017 at 6:39 AM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> BTW this is with Debian gcc 4.7.2 which does not allow an 8-byte
> stack alignment as attempted by the Makefile:

I'm pretty sure we have random asm code that may not maintain a
16-byte stack alignment when it calls other code (including, in some
cases, calling C code).

So I'm not at all convinced that this is a good idea. We shouldn't
expect 16-byte alignment to be something trustworthy.

                 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski Jan. 10, 2017, 5:09 p.m. UTC | #3
On Tue, Jan 10, 2017 at 9:05 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Jan 10, 2017 at 6:39 AM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>>
>> BTW this is with Debian gcc 4.7.2 which does not allow an 8-byte
>> stack alignment as attempted by the Makefile:
>
> I'm pretty sure we have random asm code that may not maintain a
> sus16-byte stack alignment when it calls other code (including, in some
> cases, calling C code).

I suspect so.

If we change this, changing pt_regs might make sense but is kind of
weird.  It also needs to be tested with and without frame pointers.

>
> So I'm not at all convinced that this is a good idea. We shouldn't
> expect 16-byte alignment to be something trustworthy.
>
>                  Linus
Ard Biesheuvel Jan. 10, 2017, 5:30 p.m. UTC | #4
On 10 January 2017 at 14:33, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> I recently applied the patch
>
>         https://patchwork.kernel.org/patch/9468391/
>
> and ended up with a boot crash when it tried to run the x86 chacha20
> code.  It turned out that the patch changed a manually aligned
> stack buffer to one that is aligned by gcc.  What was happening was
> that gcc can stack align to any value on x86-64 except 16.  The
> reason is that gcc assumes that the stack is always 16-byte aligned,
> which is not actually the case in the kernel.
>

Apologies for introducing this breakage. It seemed like an obvious and
simple cleanup, so I didn't even bother to mention it in the commit
log, but if the kernel does not guarantee 16 byte alignment, I guess
we should revert to the old method. If SSE instructions are the only
ones that require this alignment, then I suppose not having a ABI
conforming stack pointer should not be an issue in general.

> The x86-64 CPU actually tries to keep the stack 16-byte aligned,
> e.g., it'll do so when an IRQ comes in.  So the reason it doesn't
> work in the kernel mostly comes down to the fact that the struct
> pt_regs which lives near the top of the stack is 168 bytes which
> is not a multiple of 16.
>
> This patch tries to fix this by adding an 8-byte padding at the
> top of the call-chain involving pt_regs so that when we call a C
> function later we do so with an aligned stack.
>
> The same problem probably exists on i386 too since gcc also assumes
> 16-byte alignment there.  It's harder to fix however as the CPU
> doesn't help us in the IRQ case.
>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
>
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index 05ed3d3..29d3bcb 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -59,39 +59,42 @@
>  /*
>   * C ABI says these regs are callee-preserved. They aren't saved on kernel entry
>   * unless syscall needs a complete, fully filled "struct pt_regs".
> + *
> + * Note we add 8 extra bytes at the beginning to preserve stack alignment.
>   */
> -#define R15            0*8
> -#define R14            1*8
> -#define R13            2*8
> -#define R12            3*8
> -#define RBP            4*8
> -#define RBX            5*8
> +#define R15            1*8
> +#define R14            2*8
> +#define R13            3*8
> +#define R12            4*8
> +#define RBP            5*8
> +#define RBX            6*8
>  /* These regs are callee-clobbered. Always saved on kernel entry. */
> -#define R11            6*8
> -#define R10            7*8
> -#define R9             8*8
> -#define R8             9*8
> -#define RAX            10*8
> -#define RCX            11*8
> -#define RDX            12*8
> -#define RSI            13*8
> -#define RDI            14*8
> +#define R11            7*8
> +#define R10            8*8
> +#define R9             9*8
> +#define R8             10*8
> +#define RAX            11*8
> +#define RCX            12*8
> +#define RDX            13*8
> +#define RSI            14*8
> +#define RDI            15*8
>  /*
>   * On syscall entry, this is syscall#. On CPU exception, this is error code.
>   * On hw interrupt, it's IRQ number:
>   */
> -#define ORIG_RAX       15*8
> +#define ORIG_RAX       16*8
>  /* Return frame for iretq */
> -#define RIP            16*8
> -#define CS             17*8
> -#define EFLAGS         18*8
> -#define RSP            19*8
> -#define SS             20*8
> +#define RIP            17*8
> +#define CS             18*8
> +#define EFLAGS         19*8
> +#define RSP            20*8
> +#define SS             21*8
>
> +/* Note that this excludes the 8-byte padding. */
>  #define SIZEOF_PTREGS  21*8
>
>         .macro ALLOC_PT_GPREGS_ON_STACK
> -       addq    $-(15*8), %rsp
> +       addq    $-(16*8), %rsp
>         .endm
>
>         .macro SAVE_C_REGS_HELPER offset=0 rax=1 rcx=1 r8910=1 r11=1
> @@ -114,7 +117,7 @@
>         movq %rdi, 14*8+\offset(%rsp)
>         .endm
>         .macro SAVE_C_REGS offset=0
> -       SAVE_C_REGS_HELPER \offset, 1, 1, 1, 1
> +       SAVE_C_REGS_HELPER 8+\offset, 1, 1, 1, 1
>         .endm
>         .macro SAVE_C_REGS_EXCEPT_RAX_RCX offset=0
>         SAVE_C_REGS_HELPER \offset, 0, 0, 1, 1
> @@ -130,43 +133,43 @@
>         .endm
>
>         .macro SAVE_EXTRA_REGS offset=0
> -       movq %r15, 0*8+\offset(%rsp)
> -       movq %r14, 1*8+\offset(%rsp)
> -       movq %r13, 2*8+\offset(%rsp)
> -       movq %r12, 3*8+\offset(%rsp)
> -       movq %rbp, 4*8+\offset(%rsp)
> -       movq %rbx, 5*8+\offset(%rsp)
> +       movq %r15, 1*8+\offset(%rsp)
> +       movq %r14, 2*8+\offset(%rsp)
> +       movq %r13, 3*8+\offset(%rsp)
> +       movq %r12, 4*8+\offset(%rsp)
> +       movq %rbp, 5*8+\offset(%rsp)
> +       movq %rbx, 6*8+\offset(%rsp)
>         .endm
>
>         .macro RESTORE_EXTRA_REGS offset=0
> -       movq 0*8+\offset(%rsp), %r15
> -       movq 1*8+\offset(%rsp), %r14
> -       movq 2*8+\offset(%rsp), %r13
> -       movq 3*8+\offset(%rsp), %r12
> -       movq 4*8+\offset(%rsp), %rbp
> -       movq 5*8+\offset(%rsp), %rbx
> +       movq 1*8+\offset(%rsp), %r15
> +       movq 2*8+\offset(%rsp), %r14
> +       movq 3*8+\offset(%rsp), %r13
> +       movq 4*8+\offset(%rsp), %r12
> +       movq 5*8+\offset(%rsp), %rbp
> +       movq 6*8+\offset(%rsp), %rbx
>         .endm
>
>         .macro RESTORE_C_REGS_HELPER rstor_rax=1, rstor_rcx=1, rstor_r11=1, rstor_r8910=1, rstor_rdx=1
>         .if \rstor_r11
> -       movq 6*8(%rsp), %r11
> +       movq 7*8(%rsp), %r11
>         .endif
>         .if \rstor_r8910
> -       movq 7*8(%rsp), %r10
> -       movq 8*8(%rsp), %r9
> -       movq 9*8(%rsp), %r8
> +       movq 8*8(%rsp), %r10
> +       movq 9*8(%rsp), %r9
> +       movq 10*8(%rsp), %r8
>         .endif
>         .if \rstor_rax
> -       movq 10*8(%rsp), %rax
> +       movq 11*8(%rsp), %rax
>         .endif
>         .if \rstor_rcx
> -       movq 11*8(%rsp), %rcx
> +       movq 12*8(%rsp), %rcx
>         .endif
>         .if \rstor_rdx
> -       movq 12*8(%rsp), %rdx
> +       movq 13*8(%rsp), %rdx
>         .endif
> -       movq 13*8(%rsp), %rsi
> -       movq 14*8(%rsp), %rdi
> +       movq 14*8(%rsp), %rsi
> +       movq 15*8(%rsp), %rdi
>         .endm
>         .macro RESTORE_C_REGS
>         RESTORE_C_REGS_HELPER 1,1,1,1,1
> @@ -185,7 +188,7 @@
>         .endm
>
>         .macro REMOVE_PT_GPREGS_FROM_STACK addskip=0
> -       subq $-(15*8+\addskip), %rsp
> +       subq $-(16*8+\addskip), %rsp
>         .endm
>
>         .macro icebp
> @@ -203,11 +206,7 @@
>   */
>  .macro ENCODE_FRAME_POINTER ptregs_offset=0
>  #ifdef CONFIG_FRAME_POINTER
> -       .if \ptregs_offset
> -               leaq \ptregs_offset(%rsp), %rbp
> -       .else
> -               mov %rsp, %rbp
> -       .endif
> +       leaq    8+\ptregs_offset(%rsp), %rbp
>         orq     $0x1, %rbp
>  #endif
>  .endm
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 5b21970..880bbb8 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -168,7 +168,7 @@ GLOBAL(entry_SYSCALL_64_after_swapgs)
>         pushq   %r9                             /* pt_regs->r9 */
>         pushq   %r10                            /* pt_regs->r10 */
>         pushq   %r11                            /* pt_regs->r11 */
> -       sub     $(6*8), %rsp                    /* pt_regs->bp, bx, r12-15 not saved */
> +       sub     $(7*8), %rsp                    /* pt_regs->bp, bx, r12-15 not saved */
>
>         /*
>          * If we need to do entry work or if we guess we'll need to do
> @@ -234,14 +234,14 @@ entry_SYSCALL_64_fastpath:
>         TRACE_IRQS_ON
>         ENABLE_INTERRUPTS(CLBR_NONE)
>         SAVE_EXTRA_REGS
> -       movq    %rsp, %rdi
> +       leaq    8(%rsp), %rdi
>         call    syscall_return_slowpath /* returns with IRQs disabled */
>         jmp     return_from_SYSCALL_64
>
>  entry_SYSCALL64_slow_path:
>         /* IRQs are off. */
>         SAVE_EXTRA_REGS
> -       movq    %rsp, %rdi
> +       leaq    8(%rsp), %rdi
>         call    do_syscall_64           /* returns with IRQs disabled */
>
>  return_from_SYSCALL_64:
> @@ -342,9 +342,9 @@ ENTRY(stub_ptregs_64)
>          * Called from fast path -- disable IRQs again, pop return address
>          * and jump to slow path
>          */
> +       popq    %rax
>         DISABLE_INTERRUPTS(CLBR_NONE)
>         TRACE_IRQS_OFF
> -       popq    %rax
>         jmp     entry_SYSCALL64_slow_path
>
>  1:
> @@ -409,13 +409,14 @@ END(__switch_to_asm)
>   */
>  ENTRY(ret_from_fork)
>         movq    %rax, %rdi
> +       subq    $8, %rsp
>         call    schedule_tail                   /* rdi: 'prev' task parameter */
>
>         testq   %rbx, %rbx                      /* from kernel_thread? */
>         jnz     1f                              /* kernel threads are uncommon */
>
>  2:
> -       movq    %rsp, %rdi
> +       leaq    8(%rsp), %rdi
>         call    syscall_return_slowpath /* returns with IRQs disabled */
>         TRACE_IRQS_ON                   /* user mode is traced as IRQS on */
>         SWAPGS
> @@ -494,10 +495,12 @@ END(irq_entries_start)
>          * a little cheaper to use a separate counter in the PDA (short of
>          * moving irq_enter into assembly, which would be too much work)
>          */
> -       movq    %rsp, %rdi
> +       movq    %rsp, %rax
> +       leaq    8(%rsp), %rdi
>         incl    PER_CPU_VAR(irq_count)
>         cmovzq  PER_CPU_VAR(irq_stack_ptr), %rsp
> -       pushq   %rdi
> +       sub     $8, %rsp
> +       pushq   %rax
>         /* We entered an interrupt context - irqs are off: */
>         TRACE_IRQS_OFF
>
> @@ -527,7 +530,7 @@ ret_from_intr:
>
>         /* Interrupt came from user space */
>  GLOBAL(retint_user)
> -       mov     %rsp,%rdi
> +       leaq    8(%rsp), %rdi
>         call    prepare_exit_to_usermode
>         TRACE_IRQS_IRETQ
>         SWAPGS
> @@ -774,7 +777,7 @@ ENTRY(\sym)
>         .endif
>         .endif
>
> -       movq    %rsp, %rdi                      /* pt_regs pointer */
> +       leaq    8(%rsp), %rdi                   /* pt_regs pointer */
>
>         .if \has_error_code
>         movq    ORIG_RAX(%rsp), %rsi            /* get error code */
> @@ -810,11 +813,11 @@ ENTRY(\sym)
>         call    error_entry
>
>
> -       movq    %rsp, %rdi                      /* pt_regs pointer */
> +       leaq    8(%rsp), %rdi                   /* pt_regs pointer */
>         call    sync_regs
> -       movq    %rax, %rsp                      /* switch stack */
> +       leaq    -8(%rax), %rsp                  /* switch stack */
>
> -       movq    %rsp, %rdi                      /* pt_regs pointer */
> +       movq    %rax, %rdi                      /* pt_regs pointer */
>
>         .if \has_error_code
>         movq    ORIG_RAX(%rsp), %rsi            /* get error code */
> @@ -895,6 +898,7 @@ ENTRY(do_softirq_own_stack)
>         mov     %rsp, %rbp
>         incl    PER_CPU_VAR(irq_count)
>         cmove   PER_CPU_VAR(irq_stack_ptr), %rsp
> +       sub     $8, %rsp
>         push    %rbp                            /* frame pointer backlink */
>         call    __do_softirq
>         leaveq
> @@ -924,10 +928,11 @@ ENTRY(xen_do_hypervisor_callback)         /* do_hypervisor_callback(struct *pt_regs) */
>   * Since we don't modify %rdi, evtchn_do_upall(struct *pt_regs) will
>   * see the correct pointer to the pt_regs
>   */
> -       movq    %rdi, %rsp                      /* we don't return, adjust the stack frame */
> +       leaq    -8(%rdi), %rsp                  /* we don't return, adjust the stack frame */
>  11:    incl    PER_CPU_VAR(irq_count)
>         movq    %rsp, %rbp
>         cmovzq  PER_CPU_VAR(irq_stack_ptr), %rsp
> +       subq    $8, %rsp
>         pushq   %rbp                            /* frame pointer backlink */
>         call    xen_evtchn_do_upcall
>         popq    %rsp
> @@ -1264,6 +1269,7 @@ ENTRY(nmi)
>          */
>
>         movq    %rsp, %rdi
> +       subq    $8, %rsp
>         movq    $-1, %rsi
>         call    do_nmi
>
> @@ -1475,7 +1481,7 @@ end_repeat_nmi:
>         call    paranoid_entry
>
>         /* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
> -       movq    %rsp, %rdi
> +       leaq    8(%rsp), %rdi
>         movq    $-1, %rsi
>         call    do_nmi
>
> @@ -1519,7 +1525,7 @@ ENTRY(rewind_stack_do_exit)
>         xorl    %ebp, %ebp
>
>         movq    PER_CPU_VAR(cpu_current_top_of_stack), %rax
> -       leaq    -TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%rax), %rsp
> +       leaq    -TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE-8(%rax), %rsp
>
>         call    do_exit
>  1:     jmp 1b
> diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
> index e1721da..7d3f1e3 100644
> --- a/arch/x86/entry/entry_64_compat.S
> +++ b/arch/x86/entry/entry_64_compat.S
> @@ -89,6 +89,8 @@ ENTRY(entry_SYSENTER_compat)
>         pushq   $0                      /* pt_regs->r13 = 0 */
>         pushq   $0                      /* pt_regs->r14 = 0 */
>         pushq   $0                      /* pt_regs->r15 = 0 */
> +
> +       subq    $8, %rsp
>         cld
>
>         /*
> @@ -120,7 +122,7 @@ ENTRY(entry_SYSENTER_compat)
>          */
>         TRACE_IRQS_OFF
>
> -       movq    %rsp, %rdi
> +       leaq    8(%rsp), %rdi
>         call    do_fast_syscall_32
>         /* XEN PV guests always use IRET path */
>         ALTERNATIVE "testl %eax, %eax; jz .Lsyscall_32_done", \
> @@ -215,13 +217,15 @@ ENTRY(entry_SYSCALL_compat)
>         pushq   $0                      /* pt_regs->r14 = 0 */
>         pushq   $0                      /* pt_regs->r15 = 0 */
>
> +       subq    $8, %rsp
> +
>         /*
>          * User mode is traced as though IRQs are on, and SYSENTER
>          * turned them off.
>          */
>         TRACE_IRQS_OFF
>
> -       movq    %rsp, %rdi
> +       leaq    8(%rsp), %rdi
>         call    do_fast_syscall_32
>         /* XEN PV guests always use IRET path */
>         ALTERNATIVE "testl %eax, %eax; jz .Lsyscall_32_done", \
> @@ -324,6 +328,8 @@ ENTRY(entry_INT80_compat)
>         pushq   %r13                    /* pt_regs->r13 */
>         pushq   %r14                    /* pt_regs->r14 */
>         pushq   %r15                    /* pt_regs->r15 */
> +
> +       subq    $8, %rsp
>         cld
>
>         /*
> @@ -332,7 +338,7 @@ ENTRY(entry_INT80_compat)
>          */
>         TRACE_IRQS_OFF
>
> -       movq    %rsp, %rdi
> +       leaq    8(%rsp), %rdi
>         call    do_int80_syscall_32
>  .Lsyscall_32_done:
>
> diff --git a/arch/x86/entry/thunk_64.S b/arch/x86/entry/thunk_64.S
> index be36bf4..3c80aac 100644
> --- a/arch/x86/entry/thunk_64.S
> +++ b/arch/x86/entry/thunk_64.S
> @@ -33,6 +33,7 @@
>         movq 8(%rbp), %rdi
>         .endif
>
> +       sub $8, %rsp
>         call \func
>         jmp  .L_restore
>         _ASM_NOKPROBE(\name)
> @@ -58,6 +59,7 @@
>   || defined(CONFIG_DEBUG_LOCK_ALLOC) \
>   || defined(CONFIG_PREEMPT)
>  .L_restore:
> +       add $8, %rsp
>         popq %r11
>         popq %r10
>         popq %r9
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index b467b14..d03ab72 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -384,6 +384,8 @@ early_idt_handler_common:
>         pushq %r14                              /* pt_regs->r14 */
>         pushq %r15                              /* pt_regs->r15 */
>
> +       sub $8, %rsp
> +
>         cmpq $14,%rsi           /* Page fault? */
>         jnz 10f
>         GET_CR2_INTO(%rdi)      /* Can clobber any volatile register if pv */
> @@ -392,7 +394,7 @@ early_idt_handler_common:
>         jz 20f                  /* All good */
>
>  10:
> -       movq %rsp,%rdi          /* RDI = pt_regs; RSI is already trapnr */
> +       leaq 8(%rsp), %rdi      /* RDI = pt_regs; RSI is already trapnr */
>         call early_fixup_exception
>
>  20:
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index bf0c6d0..2af9f81 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -590,6 +590,7 @@ asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs)
>
>  struct bad_iret_stack {
>         void *error_entry_ret;
> +       void *padding;
>         struct pt_regs regs;
>  };
>
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski Jan. 10, 2017, 7 p.m. UTC | #5
On Tue, Jan 10, 2017 at 9:30 AM, Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> On 10 January 2017 at 14:33, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>> I recently applied the patch
>>
>>         https://patchwork.kernel.org/patch/9468391/
>>
>> and ended up with a boot crash when it tried to run the x86 chacha20
>> code.  It turned out that the patch changed a manually aligned
>> stack buffer to one that is aligned by gcc.  What was happening was
>> that gcc can stack align to any value on x86-64 except 16.  The
>> reason is that gcc assumes that the stack is always 16-byte aligned,
>> which is not actually the case in the kernel.
>>
>
> Apologies for introducing this breakage. It seemed like an obvious and
> simple cleanup, so I didn't even bother to mention it in the commit
> log, but if the kernel does not guarantee 16 byte alignment, I guess
> we should revert to the old method. If SSE instructions are the only
> ones that require this alignment, then I suppose not having a ABI
> conforming stack pointer should not be an issue in general.

Here's what I think is really going on.  This is partially from
memory, so I could be off base.  The kernel is up against
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383, which means that,
on some GCC versions (like the bad one and maybe even current ones),
things compiled without -mno-sse can't have the stack alignment set
properly.  IMO we should fix this in the affected code, not the entry
code.  In fact, I think that fixing it in the entry code won't even
fully fix it because modern GCC will compile the rest of the kernel
with 8-byte alignment and the stack will get randomly unaligned (GCC
4.8 and newer).

Can we just add __attribute__((force_align_arg_pointer)) to the
affected functions?  Maybe have:

#define __USES_SSE __attribute__((force_align_arg_pointer))

on affected gcc versions?

***HOWEVER***

I think this is missing the tree for the supposed forest.  The actual
affected code appears to be:

static int chacha20_simd(struct blkcipher_desc *desc, struct scatterlist *dst,
                         struct scatterlist *src, unsigned int nbytes)
{
        u32 *state, state_buf[16 + (CHACHA20_STATE_ALIGN / sizeof(u32)) - 1];

...

        state = (u32 *)roundup((uintptr_t)state_buf, CHACHA20_STATE_ALIGN);

gcc presumably infers (incorrectly) that state_buf is 16-byte aligned
and optimizes out the roundup.  How about just declaring an actual
__aligned(16) buffer, marking the function
__attribute__((force_align_arg_pointer)), and being done with it?
After all, we need that forcible alignment on *all* gcc versions.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ard Biesheuvel Jan. 10, 2017, 7:16 p.m. UTC | #6
On 10 January 2017 at 19:00, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Jan 10, 2017 at 9:30 AM, Ard Biesheuvel
> <ard.biesheuvel@linaro.org> wrote:
>> On 10 January 2017 at 14:33, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>>> I recently applied the patch
>>>
>>>         https://patchwork.kernel.org/patch/9468391/
>>>
>>> and ended up with a boot crash when it tried to run the x86 chacha20
>>> code.  It turned out that the patch changed a manually aligned
>>> stack buffer to one that is aligned by gcc.  What was happening was
>>> that gcc can stack align to any value on x86-64 except 16.  The
>>> reason is that gcc assumes that the stack is always 16-byte aligned,
>>> which is not actually the case in the kernel.
>>>
>>
>> Apologies for introducing this breakage. It seemed like an obvious and
>> simple cleanup, so I didn't even bother to mention it in the commit
>> log, but if the kernel does not guarantee 16 byte alignment, I guess
>> we should revert to the old method. If SSE instructions are the only
>> ones that require this alignment, then I suppose not having a ABI
>> conforming stack pointer should not be an issue in general.
>
> Here's what I think is really going on.  This is partially from
> memory, so I could be off base.  The kernel is up against
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383, which means that,
> on some GCC versions (like the bad one and maybe even current ones),
> things compiled without -mno-sse can't have the stack alignment set
> properly.  IMO we should fix this in the affected code, not the entry
> code.  In fact, I think that fixing it in the entry code won't even
> fully fix it because modern GCC will compile the rest of the kernel
> with 8-byte alignment and the stack will get randomly unaligned (GCC
> 4.8 and newer).
>
> Can we just add __attribute__((force_align_arg_pointer)) to the
> affected functions?  Maybe have:
>
> #define __USES_SSE __attribute__((force_align_arg_pointer))
>
> on affected gcc versions?
>
> ***HOWEVER***
>
> I think this is missing the tree for the supposed forest.  The actual
> affected code appears to be:
>
> static int chacha20_simd(struct blkcipher_desc *desc, struct scatterlist *dst,
>                          struct scatterlist *src, unsigned int nbytes)
> {
>         u32 *state, state_buf[16 + (CHACHA20_STATE_ALIGN / sizeof(u32)) - 1];
>
> ...
>
>         state = (u32 *)roundup((uintptr_t)state_buf, CHACHA20_STATE_ALIGN);
>
> gcc presumably infers (incorrectly) that state_buf is 16-byte aligned
> and optimizes out the roundup.  How about just declaring an actual
> __aligned(16) buffer, marking the function
> __attribute__((force_align_arg_pointer)), and being done with it?
> After all, we need that forcible alignment on *all* gcc versions.
>

Actually, the breakage is introduced by the patch Herbert refers to

https://patchwork.kernel.org/patch/9468391/

where the state is replaced by a simple

u32 state[16] __aligned(CHACHA20_STATE_ALIGN);

which seemed harmless enough to me. So the code above works fine.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski Jan. 10, 2017, 7:22 p.m. UTC | #7
On Tue, Jan 10, 2017 at 11:16 AM, Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> On 10 January 2017 at 19:00, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Tue, Jan 10, 2017 at 9:30 AM, Ard Biesheuvel
>> <ard.biesheuvel@linaro.org> wrote:
>>> On 10 January 2017 at 14:33, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>>>> I recently applied the patch
>>>>
>>>>         https://patchwork.kernel.org/patch/9468391/
>>>>
>>>> and ended up with a boot crash when it tried to run the x86 chacha20
>>>> code.  It turned out that the patch changed a manually aligned
>>>> stack buffer to one that is aligned by gcc.  What was happening was
>>>> that gcc can stack align to any value on x86-64 except 16.  The
>>>> reason is that gcc assumes that the stack is always 16-byte aligned,
>>>> which is not actually the case in the kernel.
>>>>
>>>
>>> Apologies for introducing this breakage. It seemed like an obvious and
>>> simple cleanup, so I didn't even bother to mention it in the commit
>>> log, but if the kernel does not guarantee 16 byte alignment, I guess
>>> we should revert to the old method. If SSE instructions are the only
>>> ones that require this alignment, then I suppose not having a ABI
>>> conforming stack pointer should not be an issue in general.
>>
>> Here's what I think is really going on.  This is partially from
>> memory, so I could be off base.  The kernel is up against
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383, which means that,
>> on some GCC versions (like the bad one and maybe even current ones),
>> things compiled without -mno-sse can't have the stack alignment set
>> properly.  IMO we should fix this in the affected code, not the entry
>> code.  In fact, I think that fixing it in the entry code won't even
>> fully fix it because modern GCC will compile the rest of the kernel
>> with 8-byte alignment and the stack will get randomly unaligned (GCC
>> 4.8 and newer).
>>
>> Can we just add __attribute__((force_align_arg_pointer)) to the
>> affected functions?  Maybe have:
>>
>> #define __USES_SSE __attribute__((force_align_arg_pointer))
>>
>> on affected gcc versions?
>>
>> ***HOWEVER***
>>
>> I think this is missing the tree for the supposed forest.  The actual
>> affected code appears to be:
>>
>> static int chacha20_simd(struct blkcipher_desc *desc, struct scatterlist *dst,
>>                          struct scatterlist *src, unsigned int nbytes)
>> {
>>         u32 *state, state_buf[16 + (CHACHA20_STATE_ALIGN / sizeof(u32)) - 1];
>>
>> ...
>>
>>         state = (u32 *)roundup((uintptr_t)state_buf, CHACHA20_STATE_ALIGN);
>>
>> gcc presumably infers (incorrectly) that state_buf is 16-byte aligned
>> and optimizes out the roundup.  How about just declaring an actual
>> __aligned(16) buffer, marking the function
>> __attribute__((force_align_arg_pointer)), and being done with it?
>> After all, we need that forcible alignment on *all* gcc versions.
>>
>
> Actually, the breakage is introduced by the patch Herbert refers to
>
> https://patchwork.kernel.org/patch/9468391/
>
> where the state is replaced by a simple
>
> u32 state[16] __aligned(CHACHA20_STATE_ALIGN);
>
> which seemed harmless enough to me. So the code above works fine.

So how about just the one-line patch of adding the
force_align_arg_pointer?  Would that solve the problem?
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ard Biesheuvel Jan. 10, 2017, 8 p.m. UTC | #8
On 10 January 2017 at 19:22, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Jan 10, 2017 at 11:16 AM, Ard Biesheuvel
> <ard.biesheuvel@linaro.org> wrote:
>> On 10 January 2017 at 19:00, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Tue, Jan 10, 2017 at 9:30 AM, Ard Biesheuvel
>>> <ard.biesheuvel@linaro.org> wrote:
>>>> On 10 January 2017 at 14:33, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>>>>> I recently applied the patch
>>>>>
>>>>>         https://patchwork.kernel.org/patch/9468391/
>>>>>
>>>>> and ended up with a boot crash when it tried to run the x86 chacha20
>>>>> code.  It turned out that the patch changed a manually aligned
>>>>> stack buffer to one that is aligned by gcc.  What was happening was
>>>>> that gcc can stack align to any value on x86-64 except 16.  The
>>>>> reason is that gcc assumes that the stack is always 16-byte aligned,
>>>>> which is not actually the case in the kernel.
>>>>>
>>>>
>>>> Apologies for introducing this breakage. It seemed like an obvious and
>>>> simple cleanup, so I didn't even bother to mention it in the commit
>>>> log, but if the kernel does not guarantee 16 byte alignment, I guess
>>>> we should revert to the old method. If SSE instructions are the only
>>>> ones that require this alignment, then I suppose not having a ABI
>>>> conforming stack pointer should not be an issue in general.
>>>
>>> Here's what I think is really going on.  This is partially from
>>> memory, so I could be off base.  The kernel is up against
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383, which means that,
>>> on some GCC versions (like the bad one and maybe even current ones),
>>> things compiled without -mno-sse can't have the stack alignment set
>>> properly.  IMO we should fix this in the affected code, not the entry
>>> code.  In fact, I think that fixing it in the entry code won't even
>>> fully fix it because modern GCC will compile the rest of the kernel
>>> with 8-byte alignment and the stack will get randomly unaligned (GCC
>>> 4.8 and newer).
>>>
>>> Can we just add __attribute__((force_align_arg_pointer)) to the
>>> affected functions?  Maybe have:
>>>
>>> #define __USES_SSE __attribute__((force_align_arg_pointer))
>>>
>>> on affected gcc versions?
>>>
>>> ***HOWEVER***
>>>
>>> I think this is missing the tree for the supposed forest.  The actual
>>> affected code appears to be:
>>>
>>> static int chacha20_simd(struct blkcipher_desc *desc, struct scatterlist *dst,
>>>                          struct scatterlist *src, unsigned int nbytes)
>>> {
>>>         u32 *state, state_buf[16 + (CHACHA20_STATE_ALIGN / sizeof(u32)) - 1];
>>>
>>> ...
>>>
>>>         state = (u32 *)roundup((uintptr_t)state_buf, CHACHA20_STATE_ALIGN);
>>>
>>> gcc presumably infers (incorrectly) that state_buf is 16-byte aligned
>>> and optimizes out the roundup.  How about just declaring an actual
>>> __aligned(16) buffer, marking the function
>>> __attribute__((force_align_arg_pointer)), and being done with it?
>>> After all, we need that forcible alignment on *all* gcc versions.
>>>
>>
>> Actually, the breakage is introduced by the patch Herbert refers to
>>
>> https://patchwork.kernel.org/patch/9468391/
>>
>> where the state is replaced by a simple
>>
>> u32 state[16] __aligned(CHACHA20_STATE_ALIGN);
>>
>> which seemed harmless enough to me. So the code above works fine.
>
> So how about just the one-line patch of adding the
> force_align_arg_pointer?  Would that solve the problem?

If it does what it says on the tin, it should fix the issue, but after
adding the attribute, I get the exact same object output, so there's
something dodgy going on here.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski Jan. 10, 2017, 11:25 p.m. UTC | #9
On Tue, Jan 10, 2017 at 12:00 PM, Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> On 10 January 2017 at 19:22, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Tue, Jan 10, 2017 at 11:16 AM, Ard Biesheuvel
>> <ard.biesheuvel@linaro.org> wrote:
>>> On 10 January 2017 at 19:00, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Tue, Jan 10, 2017 at 9:30 AM, Ard Biesheuvel
>>>> <ard.biesheuvel@linaro.org> wrote:
>>>>> On 10 January 2017 at 14:33, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>>>>>> I recently applied the patch
>>>>>>
>>>>>>         https://patchwork.kernel.org/patch/9468391/
>>>>>>
>>>>>> and ended up with a boot crash when it tried to run the x86 chacha20
>>>>>> code.  It turned out that the patch changed a manually aligned
>>>>>> stack buffer to one that is aligned by gcc.  What was happening was
>>>>>> that gcc can stack align to any value on x86-64 except 16.  The
>>>>>> reason is that gcc assumes that the stack is always 16-byte aligned,
>>>>>> which is not actually the case in the kernel.
>>>>>>
>>>>>
>>>>> Apologies for introducing this breakage. It seemed like an obvious and
>>>>> simple cleanup, so I didn't even bother to mention it in the commit
>>>>> log, but if the kernel does not guarantee 16 byte alignment, I guess
>>>>> we should revert to the old method. If SSE instructions are the only
>>>>> ones that require this alignment, then I suppose not having a ABI
>>>>> conforming stack pointer should not be an issue in general.
>>>>
>>>> Here's what I think is really going on.  This is partially from
>>>> memory, so I could be off base.  The kernel is up against
>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383, which means that,
>>>> on some GCC versions (like the bad one and maybe even current ones),
>>>> things compiled without -mno-sse can't have the stack alignment set
>>>> properly.  IMO we should fix this in the affected code, not the entry
>>>> code.  In fact, I think that fixing it in the entry code won't even
>>>> fully fix it because modern GCC will compile the rest of the kernel
>>>> with 8-byte alignment and the stack will get randomly unaligned (GCC
>>>> 4.8 and newer).
>>>>
>>>> Can we just add __attribute__((force_align_arg_pointer)) to the
>>>> affected functions?  Maybe have:
>>>>
>>>> #define __USES_SSE __attribute__((force_align_arg_pointer))
>>>>
>>>> on affected gcc versions?
>>>>
>>>> ***HOWEVER***
>>>>
>>>> I think this is missing the tree for the supposed forest.  The actual
>>>> affected code appears to be:
>>>>
>>>> static int chacha20_simd(struct blkcipher_desc *desc, struct scatterlist *dst,
>>>>                          struct scatterlist *src, unsigned int nbytes)
>>>> {
>>>>         u32 *state, state_buf[16 + (CHACHA20_STATE_ALIGN / sizeof(u32)) - 1];
>>>>
>>>> ...
>>>>
>>>>         state = (u32 *)roundup((uintptr_t)state_buf, CHACHA20_STATE_ALIGN);
>>>>
>>>> gcc presumably infers (incorrectly) that state_buf is 16-byte aligned
>>>> and optimizes out the roundup.  How about just declaring an actual
>>>> __aligned(16) buffer, marking the function
>>>> __attribute__((force_align_arg_pointer)), and being done with it?
>>>> After all, we need that forcible alignment on *all* gcc versions.
>>>>
>>>
>>> Actually, the breakage is introduced by the patch Herbert refers to
>>>
>>> https://patchwork.kernel.org/patch/9468391/
>>>
>>> where the state is replaced by a simple
>>>
>>> u32 state[16] __aligned(CHACHA20_STATE_ALIGN);
>>>
>>> which seemed harmless enough to me. So the code above works fine.
>>
>> So how about just the one-line patch of adding the
>> force_align_arg_pointer?  Would that solve the problem?
>
> If it does what it says on the tin, it should fix the issue, but after
> adding the attribute, I get the exact same object output, so there's
> something dodgy going on here.

Ugh, that's annoying.  Maybe it needs noinline too?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 11, 2017, 3:11 a.m. UTC | #10
On Tue, Jan 10, 2017 at 09:05:28AM -0800, Linus Torvalds wrote:
> On Tue, Jan 10, 2017 at 6:39 AM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> >
> > BTW this is with Debian gcc 4.7.2 which does not allow an 8-byte
> > stack alignment as attempted by the Makefile:
> 
> I'm pretty sure we have random asm code that may not maintain a
> 16-byte stack alignment when it calls other code (including, in some
> cases, calling C code).
> 
> So I'm not at all convinced that this is a good idea. We shouldn't
> expect 16-byte alignment to be something trustworthy.

Well the only other alternative I see is to ban compilers which
enforce 16-byte stack alignment, such as gcc 4.7.2.  Or is there
another way?

Cheers,
Herbert Xu Jan. 11, 2017, 3:15 a.m. UTC | #11
On Tue, Jan 10, 2017 at 05:30:48PM +0000, Ard Biesheuvel wrote:
> 
> Apologies for introducing this breakage. It seemed like an obvious and
> simple cleanup, so I didn't even bother to mention it in the commit
> log, but if the kernel does not guarantee 16 byte alignment, I guess
> we should revert to the old method. If SSE instructions are the only
> ones that require this alignment, then I suppose not having a ABI
> conforming stack pointer should not be an issue in general.

I think we need to address this regardless of your patch.  You
won't be the last person to use __attribute__ to get 16-byte
alignment on the stack.

Cheers,
Herbert Xu Jan. 11, 2017, 3:16 a.m. UTC | #12
On Tue, Jan 10, 2017 at 11:00:31AM -0800, Andy Lutomirski wrote:
> 
> Here's what I think is really going on.  This is partially from
> memory, so I could be off base.  The kernel is up against
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383, which means that,
> on some GCC versions (like the bad one and maybe even current ones),
> things compiled without -mno-sse can't have the stack alignment set
> properly.  IMO we should fix this in the affected code, not the entry

No that's not it.  My compiler (gcc 4.7.2) doesn't support it period:

$ gcc -S -O2 -mno-sse -mpreferred-stack-boundary=3 a.c
a.c:1:0: error: -mpreferred-stack-boundary=3 is not between 4 and 12
$ 

So you either have to ban all compilers older than whatever version
that started supporting 8-byte stack alignment, or fix the kernel.

Cheers,
Herbert Xu Jan. 11, 2017, 3:26 a.m. UTC | #13
On Tue, Jan 10, 2017 at 11:22:15AM -0800, Andy Lutomirski wrote:
>
> > Actually, the breakage is introduced by the patch Herbert refers to
> >
> > https://patchwork.kernel.org/patch/9468391/
> >
> > where the state is replaced by a simple
> >
> > u32 state[16] __aligned(CHACHA20_STATE_ALIGN);
> >
> > which seemed harmless enough to me. So the code above works fine.
> 
> So how about just the one-line patch of adding the
> force_align_arg_pointer?  Would that solve the problem?

It probably does.  However, this is too error-prone.  Surely
you can't expect random kernel developers to know to add this
force_align_arg_pointer every time they try to align a stack
variable to 16 bytes?

Cheers,
Herbert Xu Jan. 11, 2017, 3:26 a.m. UTC | #14
On Tue, Jan 10, 2017 at 03:25:47PM -0800, Andy Lutomirski wrote:
>
> > If it does what it says on the tin, it should fix the issue, but after
> > adding the attribute, I get the exact same object output, so there's
> > something dodgy going on here.
> 
> Ugh, that's annoying.  Maybe it needs noinline too?

Perhaps something to do with

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66697

Cheers,
Linus Torvalds Jan. 11, 2017, 3:30 a.m. UTC | #15
On Tue, Jan 10, 2017 at 7:11 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> Well the only other alternative I see is to ban compilers which
> enforce 16-byte stack alignment, such as gcc 4.7.2.

No,  you don't have to ban the compiler - it's just a "generate overly
stupid code that just uses extra instructions to likely mis-align the
stack more" issue. So it's "stupid code generation" vs "buggy".

What we should ban is code that assumes that stack objects can be
aligned to more than word boundary.

__attribute__((align)) simply doesn't work on stack objects, because
the stack isn't aligned.

If you really want more stack alignment, you have to generate that
alignment yourself by hand (and have a bigger buffer that you do that
alignment inside).

So this was just simply buggy:

      u32 state[16] __aligned(CHACHA20_STATE_ALIGN);

because you just can't do that. It's that simple. There is a reason
why the code does the dance with

    u32 *state, state_buf[16 + (CHACHA20_STATE_ALIGN / sizeof(u32)) - 1];

    state = (u32 *)roundup((uintptr_t)state_buf, CHACHA20_STATE_ALIGN);

rather than ask the compiler to do something invalid.

                Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Jan. 11, 2017, 4:17 a.m. UTC | #16
On Tue, Jan 10, 2017 at 7:30 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> If you really want more stack alignment, you have to generate that
> alignment yourself by hand (and have a bigger buffer that you do that
> alignment inside).

Side note: gcc can (and does) actually generate forced alignment using
"and" instructions on %rsp rather than assuming pre-existing
alignment.  And that would be valid.

The problem with "alignof(16)" is not that gcc couldn't generate the
alignment itself, it's just the broken "it's already aligned to 16
bytes" assumption because -mpreferred-stack-boundary=3 doesn't work.

You *could* try to hack around it by forcing a 32-byte alignment
instead. That (I think) will make gcc generate the "and" instruction
mess.

And it shouldn't actually use any more memory than doing it by hand
(by having twice the alignment and hand-aligning the pointer).

So we *could* try to just have a really hacky rule saying that you can
align stack data to 8 or 32 bytes, but *not* to 16 bytes.

That said, I do think that the "don't assume stack alignment, do it by
hand" may be the safer thing. Because who knows what the random rules
will be on other architectures.

               Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 11, 2017, 4:35 a.m. UTC | #17
On Tue, Jan 10, 2017 at 08:17:17PM -0800, Linus Torvalds wrote:
>
> That said, I do think that the "don't assume stack alignment, do it by
> hand" may be the safer thing. Because who knows what the random rules
> will be on other architectures.

Sure we can ban the use of attribute aligned on stacks.  But
what about indirect uses through structures?  For example, if
someone does

struct foo {
} __attribute__ ((__aligned__(16)));

int bar(...)
{
	struct foo f;

	return baz(&f);
}

then baz will end up with an unaligned argument.  The worst part
is that it is not at all obvious to the person writing the function
bar.

Cheers,
Andy Lutomirski Jan. 11, 2017, 6:01 a.m. UTC | #18
On Tue, Jan 10, 2017 at 8:35 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Tue, Jan 10, 2017 at 08:17:17PM -0800, Linus Torvalds wrote:
>>
>> That said, I do think that the "don't assume stack alignment, do it by
>> hand" may be the safer thing. Because who knows what the random rules
>> will be on other architectures.
>
> Sure we can ban the use of attribute aligned on stacks.  But
> what about indirect uses through structures?  For example, if
> someone does
>
> struct foo {
> } __attribute__ ((__aligned__(16)));
>
> int bar(...)
> {
>         struct foo f;
>
>         return baz(&f);
> }
>
> then baz will end up with an unaligned argument.  The worst part
> is that it is not at all obvious to the person writing the function
> bar.

Linus, I'm starting to lean toward agreeing with Herbert here, except
that we should consider making it conditional on having a silly GCC
version.  After all, the silly GCC versions are wasting space and time
with alignment instructions no matter what we do, so this would just
mean tweaking the asm and adding some kind of check_stack_alignment()
helper to throw out a WARN_ONCE() if we miss one.  The problem with
making it conditional is that making pt_regs effectively live at a
variable offset from %rsp is just nasty.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ard Biesheuvel Jan. 11, 2017, 8:06 a.m. UTC | #19
On 11 January 2017 at 06:53, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> On Jan 10, 2017 8:36 PM, "Herbert Xu" <herbert@gondor.apana.org.au> wrote:
>
>
> Sure we can ban the use of attribute aligned on stacks.  But
> what about indirect uses through structures?
>
>
> It should be pretty trivial to add a sparse warning for that, though.
>

Couldn't we update the __aligned(x) macro to emit 32 if arch == x86
and x == 16? All other cases should work just fine afaict
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 11, 2017, 8:09 a.m. UTC | #20
On Wed, Jan 11, 2017 at 08:06:54AM +0000, Ard Biesheuvel wrote:
>
> Couldn't we update the __aligned(x) macro to emit 32 if arch == x86
> and x == 16? All other cases should work just fine afaict

Not everyone uses that macro.  You'd also need to add some checks
to stop people from using the gcc __attribute__ directly.

Cheers,
Andy Lutomirski Jan. 11, 2017, 6:20 p.m. UTC | #21
On Wed, Jan 11, 2017 at 12:09 AM, Herbert Xu
<herbert@gondor.apana.org.au> wrote:
> On Wed, Jan 11, 2017 at 08:06:54AM +0000, Ard Biesheuvel wrote:
>>
>> Couldn't we update the __aligned(x) macro to emit 32 if arch == x86
>> and x == 16? All other cases should work just fine afaict
>
> Not everyone uses that macro.  You'd also need to add some checks
> to stop people from using the gcc __attribute__ directly.
>

You'd also have to audit things to make sure that __aligned__(16)
isn't being used for non-stack purposes.  After all, __aligned__(16)
in static data is fine, and it's also fine as a promise to GCC that
some object is 16-byte aligned.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 12, 2017, 6:12 a.m. UTC | #22
On Tue, Jan 10, 2017 at 05:30:48PM +0000, Ard Biesheuvel wrote:
>
> Apologies for introducing this breakage. It seemed like an obvious and
> simple cleanup, so I didn't even bother to mention it in the commit
> log, but if the kernel does not guarantee 16 byte alignment, I guess
> we should revert to the old method. If SSE instructions are the only
> ones that require this alignment, then I suppose not having a ABI
> conforming stack pointer should not be an issue in general.

BTW Ard, what is the stack alignment on ARM64?

Cheers,
Andy Lutomirski Jan. 12, 2017, 6:21 a.m. UTC | #23
On Tue, Jan 10, 2017 at 10:01 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Jan 10, 2017 at 8:35 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>> On Tue, Jan 10, 2017 at 08:17:17PM -0800, Linus Torvalds wrote:
>>>
>>> That said, I do think that the "don't assume stack alignment, do it by
>>> hand" may be the safer thing. Because who knows what the random rules
>>> will be on other architectures.
>>
>> Sure we can ban the use of attribute aligned on stacks.  But
>> what about indirect uses through structures?  For example, if
>> someone does
>>
>> struct foo {
>> } __attribute__ ((__aligned__(16)));
>>
>> int bar(...)
>> {
>>         struct foo f;
>>
>>         return baz(&f);
>> }
>>
>> then baz will end up with an unaligned argument.  The worst part
>> is that it is not at all obvious to the person writing the function
>> bar.
>
> Linus, I'm starting to lean toward agreeing with Herbert here, except
> that we should consider making it conditional on having a silly GCC
> version.  After all, the silly GCC versions are wasting space and time
> with alignment instructions no matter what we do, so this would just
> mean tweaking the asm and adding some kind of check_stack_alignment()
> helper to throw out a WARN_ONCE() if we miss one.  The problem with
> making it conditional is that making pt_regs effectively live at a
> variable offset from %rsp is just nasty.

So actually doing this is gross because we have calls from asm to C
all over the place.  But... maybe we can automate all the testing.
Josh, how hard would it be to teach objtool to (if requested by an
option) check that stack frames with statically known size preserve
16-byte stack alignment?

I find it rather annoying that gcc before 4.8 malfunctions when it
sees __aligned__(16) on x86_64 kernels.  Sigh.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 12, 2017, 7:05 a.m. UTC | #24
On Tue, Jan 10, 2017 at 09:05:28AM -0800, Linus Torvalds wrote:
>
> I'm pretty sure we have random asm code that may not maintain a
> 16-byte stack alignment when it calls other code (including, in some
> cases, calling C code).
> 
> So I'm not at all convinced that this is a good idea. We shouldn't
> expect 16-byte alignment to be something trustworthy.

So what if we audited all the x86 assembly code to fix this? Would
it then be acceptable to do a 16-byte aligned stack?

On the face of it it doesn't seem to be a huge amount of code
assuming they mostly live under arch/x86.

Cheers,
Ingo Molnar Jan. 12, 2017, 7:40 a.m. UTC | #25
* Andy Lutomirski <luto@amacapital.net> wrote:

> I find it rather annoying that gcc before 4.8 malfunctions when it
> sees __aligned__(16) on x86_64 kernels.  Sigh.

Ran into this when writing silly FPU in-kernel testcases a couple of months ago...

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar Jan. 12, 2017, 7:46 a.m. UTC | #26
* Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Tue, Jan 10, 2017 at 09:05:28AM -0800, Linus Torvalds wrote:
> >
> > I'm pretty sure we have random asm code that may not maintain a
> > 16-byte stack alignment when it calls other code (including, in some
> > cases, calling C code).
> > 
> > So I'm not at all convinced that this is a good idea. We shouldn't
> > expect 16-byte alignment to be something trustworthy.
> 
> So what if we audited all the x86 assembly code to fix this? Would
> it then be acceptable to do a 16-byte aligned stack?

Audits for small but deadly details that isn't checked automatically by tooling 
would inevitably bitrot again - and in this particular case there's a 50% chance 
that a new, buggy change would test out to be 'fine' on a kernel developer's own 
box - and break on different configs, different hw or with unrelated (and 
innocent) kernel changes, sometime later - spreading the pain unnecessarily.

So my feeling is that we really need improved tooling for this (and yes, the GCC 
toolchain should have handled this correctly).

But fortunately we have related tooling in the kernel: could objtool handle this? 
My secret hope was always that objtool would grow into a kind of life insurance 
against toolchain bogosities (which is a must for things like livepatching or a 
DWARF unwinder - but I digress).

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski Jan. 12, 2017, 7:51 a.m. UTC | #27
On Wed, Jan 11, 2017 at 11:05 PM, Herbert Xu
<herbert@gondor.apana.org.au> wrote:
> On Tue, Jan 10, 2017 at 09:05:28AM -0800, Linus Torvalds wrote:
>>
>> I'm pretty sure we have random asm code that may not maintain a
>> 16-byte stack alignment when it calls other code (including, in some
>> cases, calling C code).
>>
>> So I'm not at all convinced that this is a good idea. We shouldn't
>> expect 16-byte alignment to be something trustworthy.
>
> So what if we audited all the x86 assembly code to fix this? Would
> it then be acceptable to do a 16-byte aligned stack?
>
> On the face of it it doesn't seem to be a huge amount of code
> assuming they mostly live under arch/x86.

The problem is that we have nasties like TRACE_IRQS_OFF.  Performance
doesn't really matter for these macros, so we could probably rig up a
helper for forcibly align the stack there.  Maybe
FRAME_BEGIN_FORCE_ALIGN?  I also think I'd rather not to modify
pt_regs.  We should just fix the small number of code paths that
create a pt_regs and then call into C code to align the stack.

But if we can't do this with automatic verification, then I'm not sure
I want to do it at all.  The asm is already more precarious than I'd
like, and having a code path that is misaligned is asking for obscure
bugs down the road.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ard Biesheuvel Jan. 12, 2017, 8:01 a.m. UTC | #28
On 12 January 2017 at 06:12, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Tue, Jan 10, 2017 at 05:30:48PM +0000, Ard Biesheuvel wrote:
>>
>> Apologies for introducing this breakage. It seemed like an obvious and
>> simple cleanup, so I didn't even bother to mention it in the commit
>> log, but if the kernel does not guarantee 16 byte alignment, I guess
>> we should revert to the old method. If SSE instructions are the only
>> ones that require this alignment, then I suppose not having a ABI
>> conforming stack pointer should not be an issue in general.
>
> BTW Ard, what is the stack alignment on ARM64?
>

[From memory] the arm64 ELF psABI mandates a 16 byte stack alignment
at function entry, and 8 byte alignment at all other times. This means
compiled code will typically preserve 16 byte alignment, and
__aligned(16) on a stack variable will likely not result in an
explicit alignment of the stack pointer *. But the arm64 ISA does not
have any load/store instructions that would trigger a misalignment
fault on an address that is 8 byte aligned but not 16 byte aligned, so
the situation is very different from x86 (assuming I am correct in
asserting that there are no failure modes resulting from a misaligned
stack other than this one and a potential performance hit)

* I didn't check whether the exception handling realigns the stack
pointer on nested exceptions (arm64 has separate IRQ stacks)
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 12, 2017, 8:04 a.m. UTC | #29
On Wed, Jan 11, 2017 at 11:51:10PM -0800, Andy Lutomirski wrote:
>
> The problem is that we have nasties like TRACE_IRQS_OFF.  Performance

I don't understand.  What's the issue with TRACE_IRQS_OFF? It should
be treated as any other function call.  That is, enter it with an
aligned stack, and the TRACE_IRQS_OFF code itself should ensure
the stack stays aligned before it calls down into C.
 
> But if we can't do this with automatic verification, then I'm not sure
> I want to do it at all.  The asm is already more precarious than I'd
> like, and having a code path that is misaligned is asking for obscure
> bugs down the road.

I understand the need for automated checks at this point in time.
 But longer term this is just part of the calling ABI.  After all,
we don't add checks everywhere to ensure people preserve rbx.

Cheers,
Herbert Xu Jan. 12, 2017, 8:06 a.m. UTC | #30
On Thu, Jan 12, 2017 at 08:01:51AM +0000, Ard Biesheuvel wrote:
> 
> [From memory] the arm64 ELF psABI mandates a 16 byte stack alignment
> at function entry, and 8 byte alignment at all other times. This means
> compiled code will typically preserve 16 byte alignment, and
> __aligned(16) on a stack variable will likely not result in an
> explicit alignment of the stack pointer *. But the arm64 ISA does not
> have any load/store instructions that would trigger a misalignment
> fault on an address that is 8 byte aligned but not 16 byte aligned, so
> the situation is very different from x86 (assuming I am correct in
> asserting that there are no failure modes resulting from a misaligned
> stack other than this one and a potential performance hit)

OK, sounds like we're already using 16-byte aligned stacks on
ARM64.  So unless x86-64 stacks are much smaller, I don't see
the need to use 8-byte aligned stacks at least from a stack space
point-of-view.

Thanks,
Ingo Molnar Jan. 12, 2017, 8:18 a.m. UTC | #31
* Herbert Xu <herbert@gondor.apana.org.au> wrote:

> > But if we can't do this with automatic verification, then I'm not sure
> > I want to do it at all.  The asm is already more precarious than I'd
> > like, and having a code path that is misaligned is asking for obscure
> > bugs down the road.
> 
> I understand the need for automated checks at this point in time.
>  But longer term this is just part of the calling ABI.  After all,
> we don't add checks everywhere to ensure people preserve rbx.

The intelligent and responsible way to introduce such post facto ABI changes is 
via a smarter assembler: which would detect the obvious cases where assembly code 
generates a misaligned stack, at build time.

Assembly code can obviously still mess up in a hard to detect fashion if it tries 
- but that's OK, as most assembly code doesn't try to go outside regular stack 
allocation patterns.

Such a static check is relatively straightforward to do in assembly tooling - and 
perhaps objtool could do this too, as it already tracks the instructions that 
change the stack offset.

( And yes, this is what the GCC guys should have done, instead of sloppily 
  introducing such silent breakages and making the whole application landscape 
  less robust ... )

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josh Poimboeuf Jan. 12, 2017, 2:02 p.m. UTC | #32
On Wed, Jan 11, 2017 at 10:21:07PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 10, 2017 at 10:01 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Tue, Jan 10, 2017 at 8:35 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> >> On Tue, Jan 10, 2017 at 08:17:17PM -0800, Linus Torvalds wrote:
> >>>
> >>> That said, I do think that the "don't assume stack alignment, do it by
> >>> hand" may be the safer thing. Because who knows what the random rules
> >>> will be on other architectures.
> >>
> >> Sure we can ban the use of attribute aligned on stacks.  But
> >> what about indirect uses through structures?  For example, if
> >> someone does
> >>
> >> struct foo {
> >> } __attribute__ ((__aligned__(16)));
> >>
> >> int bar(...)
> >> {
> >>         struct foo f;
> >>
> >>         return baz(&f);
> >> }
> >>
> >> then baz will end up with an unaligned argument.  The worst part
> >> is that it is not at all obvious to the person writing the function
> >> bar.
> >
> > Linus, I'm starting to lean toward agreeing with Herbert here, except
> > that we should consider making it conditional on having a silly GCC
> > version.  After all, the silly GCC versions are wasting space and time
> > with alignment instructions no matter what we do, so this would just
> > mean tweaking the asm and adding some kind of check_stack_alignment()
> > helper to throw out a WARN_ONCE() if we miss one.  The problem with
> > making it conditional is that making pt_regs effectively live at a
> > variable offset from %rsp is just nasty.
> 
> So actually doing this is gross because we have calls from asm to C
> all over the place.  But... maybe we can automate all the testing.
> Josh, how hard would it be to teach objtool to (if requested by an
> option) check that stack frames with statically known size preserve
> 16-byte stack alignment?
> 
> I find it rather annoying that gcc before 4.8 malfunctions when it
> sees __aligned__(16) on x86_64 kernels.  Sigh.

Just to clarify, I think you're asking if, for versions of gcc which
don't support -mpreferred-stack-boundary=3, objtool can analyze all C
functions to ensure their stacks are 16-byte aligned.

It's certainly possible, but I don't see how that solves the problem.
The stack will still be misaligned by entry code.  Or am I missing
something?
Josh Poimboeuf Jan. 12, 2017, 2:49 p.m. UTC | #33
On Thu, Jan 12, 2017 at 08:46:01AM +0100, Ingo Molnar wrote:
> 
> * Herbert Xu <herbert@gondor.apana.org.au> wrote:
> 
> > On Tue, Jan 10, 2017 at 09:05:28AM -0800, Linus Torvalds wrote:
> > >
> > > I'm pretty sure we have random asm code that may not maintain a
> > > 16-byte stack alignment when it calls other code (including, in some
> > > cases, calling C code).
> > > 
> > > So I'm not at all convinced that this is a good idea. We shouldn't
> > > expect 16-byte alignment to be something trustworthy.
> > 
> > So what if we audited all the x86 assembly code to fix this? Would
> > it then be acceptable to do a 16-byte aligned stack?
> 
> Audits for small but deadly details that isn't checked automatically by tooling 
> would inevitably bitrot again - and in this particular case there's a 50% chance 
> that a new, buggy change would test out to be 'fine' on a kernel developer's own 
> box - and break on different configs, different hw or with unrelated (and 
> innocent) kernel changes, sometime later - spreading the pain unnecessarily.
> 
> So my feeling is that we really need improved tooling for this (and yes, the GCC 
> toolchain should have handled this correctly).
> 
> But fortunately we have related tooling in the kernel: could objtool handle this? 
> My secret hope was always that objtool would grow into a kind of life insurance 
> against toolchain bogosities (which is a must for things like livepatching or a 
> DWARF unwinder - but I digress).

Are we talking about entry code, or other asm code?  Because objtool
audits *most* asm code, but entry code is way too specialized for
objtool to understand.

(I do have a pending objtool rewrite which would make it very easy to
ensure 16-byte stack alignment.  But again, objtool can only understand
callable C or asm functions, not entry code.)

Another approach would be to solve this problem with unwinder warnings,
*if* there's enough test coverage.

I recently made some changes to try to standardize the "end" of the
stack, so that the stack pointer is always a certain value before
calling into C code.  I also added some warnings to the unwinder to
ensure that it always reaches that point on the stack.  So if the "end"
of the stack were adjusted by a word by adding padding to pt_regs, the
unwinder warnings could help preserve that.

We could take that a step further by adding an unwinder check to ensure
that *every* frame is 16-byte aligned if -mpreferred-stack-boundary=3
isn't used.

Yet another step would be to add a debug feature which does stack sanity
checking from a periodic NMI, to flush out these unwinder warnings.

(Though I've found that current 0-day and fuzzing efforts, combined with
lockdep and perf's frequent unwinder usage, are already doing a great
job at flushing out unwinder warnings.)

The only question is if there would be enough test coverage,
particularly with those versions of gcc which don't have
-mpreferred-stack-boundary=3.
Josh Poimboeuf Jan. 12, 2017, 3:03 p.m. UTC | #34
On Wed, Jan 11, 2017 at 11:51:10PM -0800, Andy Lutomirski wrote:
> On Wed, Jan 11, 2017 at 11:05 PM, Herbert Xu
> <herbert@gondor.apana.org.au> wrote:
> > On Tue, Jan 10, 2017 at 09:05:28AM -0800, Linus Torvalds wrote:
> >>
> >> I'm pretty sure we have random asm code that may not maintain a
> >> 16-byte stack alignment when it calls other code (including, in some
> >> cases, calling C code).
> >>
> >> So I'm not at all convinced that this is a good idea. We shouldn't
> >> expect 16-byte alignment to be something trustworthy.
> >
> > So what if we audited all the x86 assembly code to fix this? Would
> > it then be acceptable to do a 16-byte aligned stack?
> >
> > On the face of it it doesn't seem to be a huge amount of code
> > assuming they mostly live under arch/x86.
> 
> The problem is that we have nasties like TRACE_IRQS_OFF.  Performance
> doesn't really matter for these macros, so we could probably rig up a
> helper for forcibly align the stack there.  Maybe
> FRAME_BEGIN_FORCE_ALIGN?  I also think I'd rather not to modify
> pt_regs.  We should just fix the small number of code paths that
> create a pt_regs and then call into C code to align the stack.
> 
> But if we can't do this with automatic verification, then I'm not sure
> I want to do it at all.  The asm is already more precarious than I'd
> like, and having a code path that is misaligned is asking for obscure
> bugs down the road.

For the entry code, could we just replace all calls with CALL_ALIGNED?
That might be less intrusive than trying to adjust all the pt_regs
accesses.

Then to ensure that nobody ever uses 'call' directly:

  '#define call please-use-CALL-ALIGNED-instead-of-call'

I think that would be possible if CALL_ALIGNED were a ".macro".
Herbert Xu Jan. 12, 2017, 3:06 p.m. UTC | #35
On Thu, Jan 12, 2017 at 09:03:18AM -0600, Josh Poimboeuf wrote:
> 
> For the entry code, could we just replace all calls with CALL_ALIGNED?
> That might be less intrusive than trying to adjust all the pt_regs
> accesses.
> 
> Then to ensure that nobody ever uses 'call' directly:
> 
>   '#define call please-use-CALL-ALIGNED-instead-of-call'
> 
> I think that would be possible if CALL_ALIGNED were a ".macro".

The trouble with that is that you have got things like TRACE_IRQS_OFF
which are also used outside of the entry code.

Cheers,
Josh Poimboeuf Jan. 12, 2017, 3:10 p.m. UTC | #36
On Thu, Jan 12, 2017 at 09:03:18AM -0600, Josh Poimboeuf wrote:
> On Wed, Jan 11, 2017 at 11:51:10PM -0800, Andy Lutomirski wrote:
> > On Wed, Jan 11, 2017 at 11:05 PM, Herbert Xu
> > <herbert@gondor.apana.org.au> wrote:
> > > On Tue, Jan 10, 2017 at 09:05:28AM -0800, Linus Torvalds wrote:
> > >>
> > >> I'm pretty sure we have random asm code that may not maintain a
> > >> 16-byte stack alignment when it calls other code (including, in some
> > >> cases, calling C code).
> > >>
> > >> So I'm not at all convinced that this is a good idea. We shouldn't
> > >> expect 16-byte alignment to be something trustworthy.
> > >
> > > So what if we audited all the x86 assembly code to fix this? Would
> > > it then be acceptable to do a 16-byte aligned stack?
> > >
> > > On the face of it it doesn't seem to be a huge amount of code
> > > assuming they mostly live under arch/x86.
> > 
> > The problem is that we have nasties like TRACE_IRQS_OFF.  Performance
> > doesn't really matter for these macros, so we could probably rig up a
> > helper for forcibly align the stack there.  Maybe
> > FRAME_BEGIN_FORCE_ALIGN?  I also think I'd rather not to modify
> > pt_regs.  We should just fix the small number of code paths that
> > create a pt_regs and then call into C code to align the stack.
> > 
> > But if we can't do this with automatic verification, then I'm not sure
> > I want to do it at all.  The asm is already more precarious than I'd
> > like, and having a code path that is misaligned is asking for obscure
> > bugs down the road.
> 
> For the entry code, could we just replace all calls with CALL_ALIGNED?
> That might be less intrusive than trying to adjust all the pt_regs
> accesses.
> 
> Then to ensure that nobody ever uses 'call' directly:
> 
>   '#define call please-use-CALL-ALIGNED-instead-of-call'
> 
> I think that would be possible if CALL_ALIGNED were a ".macro".

To clarify, CALL_ALIGNED could be (completely untested):

.macro CALL_ALIGNED \func
	push	%rbp
	movq	%rsp, %rbp
	and	$0xfffffffffffffff0,%rsp
	call	\func
	movq	%rbp, %rsp
	pop	%rbp
.endm
Josh Poimboeuf Jan. 12, 2017, 3:18 p.m. UTC | #37
On Thu, Jan 12, 2017 at 11:06:50PM +0800, Herbert Xu wrote:
> On Thu, Jan 12, 2017 at 09:03:18AM -0600, Josh Poimboeuf wrote:
> > 
> > For the entry code, could we just replace all calls with CALL_ALIGNED?
> > That might be less intrusive than trying to adjust all the pt_regs
> > accesses.
> > 
> > Then to ensure that nobody ever uses 'call' directly:
> > 
> >   '#define call please-use-CALL-ALIGNED-instead-of-call'
> > 
> > I think that would be possible if CALL_ALIGNED were a ".macro".
> 
> The trouble with that is that you have got things like TRACE_IRQS_OFF
> which are also used outside of the entry code.

Where?  As far as I can tell, TRACE_IRQS_OFF is used exclusively by entry
code.

Patch
diff mbox

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 05ed3d3..29d3bcb 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -59,39 +59,42 @@ 
 /*
  * C ABI says these regs are callee-preserved. They aren't saved on kernel entry
  * unless syscall needs a complete, fully filled "struct pt_regs".
+ *
+ * Note we add 8 extra bytes at the beginning to preserve stack alignment.
  */
-#define R15		0*8
-#define R14		1*8
-#define R13		2*8
-#define R12		3*8
-#define RBP		4*8
-#define RBX		5*8
+#define R15		1*8
+#define R14		2*8
+#define R13		3*8
+#define R12		4*8
+#define RBP		5*8
+#define RBX		6*8
 /* These regs are callee-clobbered. Always saved on kernel entry. */
-#define R11		6*8
-#define R10		7*8
-#define R9		8*8
-#define R8		9*8
-#define RAX		10*8
-#define RCX		11*8
-#define RDX		12*8
-#define RSI		13*8
-#define RDI		14*8
+#define R11		7*8
+#define R10		8*8
+#define R9		9*8
+#define R8		10*8
+#define RAX		11*8
+#define RCX		12*8
+#define RDX		13*8
+#define RSI		14*8
+#define RDI		15*8
 /*
  * On syscall entry, this is syscall#. On CPU exception, this is error code.
  * On hw interrupt, it's IRQ number:
  */
-#define ORIG_RAX	15*8
+#define ORIG_RAX	16*8
 /* Return frame for iretq */
-#define RIP		16*8
-#define CS		17*8
-#define EFLAGS		18*8
-#define RSP		19*8
-#define SS		20*8
+#define RIP		17*8
+#define CS		18*8
+#define EFLAGS		19*8
+#define RSP		20*8
+#define SS		21*8
 
+/* Note that this excludes the 8-byte padding. */
 #define SIZEOF_PTREGS	21*8
 
 	.macro ALLOC_PT_GPREGS_ON_STACK
-	addq	$-(15*8), %rsp
+	addq	$-(16*8), %rsp
 	.endm
 
 	.macro SAVE_C_REGS_HELPER offset=0 rax=1 rcx=1 r8910=1 r11=1
@@ -114,7 +117,7 @@ 
 	movq %rdi, 14*8+\offset(%rsp)
 	.endm
 	.macro SAVE_C_REGS offset=0
-	SAVE_C_REGS_HELPER \offset, 1, 1, 1, 1
+	SAVE_C_REGS_HELPER 8+\offset, 1, 1, 1, 1
 	.endm
 	.macro SAVE_C_REGS_EXCEPT_RAX_RCX offset=0
 	SAVE_C_REGS_HELPER \offset, 0, 0, 1, 1
@@ -130,43 +133,43 @@ 
 	.endm
 
 	.macro SAVE_EXTRA_REGS offset=0
-	movq %r15, 0*8+\offset(%rsp)
-	movq %r14, 1*8+\offset(%rsp)
-	movq %r13, 2*8+\offset(%rsp)
-	movq %r12, 3*8+\offset(%rsp)
-	movq %rbp, 4*8+\offset(%rsp)
-	movq %rbx, 5*8+\offset(%rsp)
+	movq %r15, 1*8+\offset(%rsp)
+	movq %r14, 2*8+\offset(%rsp)
+	movq %r13, 3*8+\offset(%rsp)
+	movq %r12, 4*8+\offset(%rsp)
+	movq %rbp, 5*8+\offset(%rsp)
+	movq %rbx, 6*8+\offset(%rsp)
 	.endm
 
 	.macro RESTORE_EXTRA_REGS offset=0
-	movq 0*8+\offset(%rsp), %r15
-	movq 1*8+\offset(%rsp), %r14
-	movq 2*8+\offset(%rsp), %r13
-	movq 3*8+\offset(%rsp), %r12
-	movq 4*8+\offset(%rsp), %rbp
-	movq 5*8+\offset(%rsp), %rbx
+	movq 1*8+\offset(%rsp), %r15
+	movq 2*8+\offset(%rsp), %r14
+	movq 3*8+\offset(%rsp), %r13
+	movq 4*8+\offset(%rsp), %r12
+	movq 5*8+\offset(%rsp), %rbp
+	movq 6*8+\offset(%rsp), %rbx
 	.endm
 
 	.macro RESTORE_C_REGS_HELPER rstor_rax=1, rstor_rcx=1, rstor_r11=1, rstor_r8910=1, rstor_rdx=1
 	.if \rstor_r11
-	movq 6*8(%rsp), %r11
+	movq 7*8(%rsp), %r11
 	.endif
 	.if \rstor_r8910
-	movq 7*8(%rsp), %r10
-	movq 8*8(%rsp), %r9
-	movq 9*8(%rsp), %r8
+	movq 8*8(%rsp), %r10
+	movq 9*8(%rsp), %r9
+	movq 10*8(%rsp), %r8
 	.endif
 	.if \rstor_rax
-	movq 10*8(%rsp), %rax
+	movq 11*8(%rsp), %rax
 	.endif
 	.if \rstor_rcx
-	movq 11*8(%rsp), %rcx
+	movq 12*8(%rsp), %rcx
 	.endif
 	.if \rstor_rdx
-	movq 12*8(%rsp), %rdx
+	movq 13*8(%rsp), %rdx
 	.endif
-	movq 13*8(%rsp), %rsi
-	movq 14*8(%rsp), %rdi
+	movq 14*8(%rsp), %rsi
+	movq 15*8(%rsp), %rdi
 	.endm
 	.macro RESTORE_C_REGS
 	RESTORE_C_REGS_HELPER 1,1,1,1,1
@@ -185,7 +188,7 @@ 
 	.endm
 
 	.macro REMOVE_PT_GPREGS_FROM_STACK addskip=0
-	subq $-(15*8+\addskip), %rsp
+	subq $-(16*8+\addskip), %rsp
 	.endm
 
 	.macro icebp
@@ -203,11 +206,7 @@ 
  */
 .macro ENCODE_FRAME_POINTER ptregs_offset=0
 #ifdef CONFIG_FRAME_POINTER
-	.if \ptregs_offset
-		leaq \ptregs_offset(%rsp), %rbp
-	.else
-		mov %rsp, %rbp
-	.endif
+	leaq	8+\ptregs_offset(%rsp), %rbp
 	orq	$0x1, %rbp
 #endif
 .endm
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 5b21970..880bbb8 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -168,7 +168,7 @@  GLOBAL(entry_SYSCALL_64_after_swapgs)
 	pushq	%r9				/* pt_regs->r9 */
 	pushq	%r10				/* pt_regs->r10 */
 	pushq	%r11				/* pt_regs->r11 */
-	sub	$(6*8), %rsp			/* pt_regs->bp, bx, r12-15 not saved */
+	sub	$(7*8), %rsp			/* pt_regs->bp, bx, r12-15 not saved */
 
 	/*
 	 * If we need to do entry work or if we guess we'll need to do
@@ -234,14 +234,14 @@  entry_SYSCALL_64_fastpath:
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_NONE)
 	SAVE_EXTRA_REGS
-	movq	%rsp, %rdi
+	leaq	8(%rsp), %rdi
 	call	syscall_return_slowpath	/* returns with IRQs disabled */
 	jmp	return_from_SYSCALL_64
 
 entry_SYSCALL64_slow_path:
 	/* IRQs are off. */
 	SAVE_EXTRA_REGS
-	movq	%rsp, %rdi
+	leaq	8(%rsp), %rdi
 	call	do_syscall_64		/* returns with IRQs disabled */
 
 return_from_SYSCALL_64:
@@ -342,9 +342,9 @@  ENTRY(stub_ptregs_64)
 	 * Called from fast path -- disable IRQs again, pop return address
 	 * and jump to slow path
 	 */
+	popq	%rax
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
-	popq	%rax
 	jmp	entry_SYSCALL64_slow_path
 
 1:
@@ -409,13 +409,14 @@  END(__switch_to_asm)
  */
 ENTRY(ret_from_fork)
 	movq	%rax, %rdi
+	subq	$8, %rsp
 	call	schedule_tail			/* rdi: 'prev' task parameter */
 
 	testq	%rbx, %rbx			/* from kernel_thread? */
 	jnz	1f				/* kernel threads are uncommon */
 
 2:
-	movq	%rsp, %rdi
+	leaq	8(%rsp), %rdi
 	call	syscall_return_slowpath	/* returns with IRQs disabled */
 	TRACE_IRQS_ON			/* user mode is traced as IRQS on */
 	SWAPGS
@@ -494,10 +495,12 @@  END(irq_entries_start)
 	 * a little cheaper to use a separate counter in the PDA (short of
 	 * moving irq_enter into assembly, which would be too much work)
 	 */
-	movq	%rsp, %rdi
+	movq	%rsp, %rax
+	leaq	8(%rsp), %rdi
 	incl	PER_CPU_VAR(irq_count)
 	cmovzq	PER_CPU_VAR(irq_stack_ptr), %rsp
-	pushq	%rdi
+	sub	$8, %rsp
+	pushq	%rax
 	/* We entered an interrupt context - irqs are off: */
 	TRACE_IRQS_OFF
 
@@ -527,7 +530,7 @@  ret_from_intr:
 
 	/* Interrupt came from user space */
 GLOBAL(retint_user)
-	mov	%rsp,%rdi
+	leaq	8(%rsp), %rdi
 	call	prepare_exit_to_usermode
 	TRACE_IRQS_IRETQ
 	SWAPGS
@@ -774,7 +777,7 @@  ENTRY(\sym)
 	.endif
 	.endif
 
-	movq	%rsp, %rdi			/* pt_regs pointer */
+	leaq	8(%rsp), %rdi			/* pt_regs pointer */
 
 	.if \has_error_code
 	movq	ORIG_RAX(%rsp), %rsi		/* get error code */
@@ -810,11 +813,11 @@  ENTRY(\sym)
 	call	error_entry
 
 
-	movq	%rsp, %rdi			/* pt_regs pointer */
+	leaq	8(%rsp), %rdi			/* pt_regs pointer */
 	call	sync_regs
-	movq	%rax, %rsp			/* switch stack */
+	leaq	-8(%rax), %rsp			/* switch stack */
 
-	movq	%rsp, %rdi			/* pt_regs pointer */
+	movq	%rax, %rdi			/* pt_regs pointer */
 
 	.if \has_error_code
 	movq	ORIG_RAX(%rsp), %rsi		/* get error code */
@@ -895,6 +898,7 @@  ENTRY(do_softirq_own_stack)
 	mov	%rsp, %rbp
 	incl	PER_CPU_VAR(irq_count)
 	cmove	PER_CPU_VAR(irq_stack_ptr), %rsp
+	sub	$8, %rsp
 	push	%rbp				/* frame pointer backlink */
 	call	__do_softirq
 	leaveq
@@ -924,10 +928,11 @@  ENTRY(xen_do_hypervisor_callback)		/* do_hypervisor_callback(struct *pt_regs) */
  * Since we don't modify %rdi, evtchn_do_upall(struct *pt_regs) will
  * see the correct pointer to the pt_regs
  */
-	movq	%rdi, %rsp			/* we don't return, adjust the stack frame */
+	leaq	-8(%rdi), %rsp			/* we don't return, adjust the stack frame */
 11:	incl	PER_CPU_VAR(irq_count)
 	movq	%rsp, %rbp
 	cmovzq	PER_CPU_VAR(irq_stack_ptr), %rsp
+	subq	$8, %rsp
 	pushq	%rbp				/* frame pointer backlink */
 	call	xen_evtchn_do_upcall
 	popq	%rsp
@@ -1264,6 +1269,7 @@  ENTRY(nmi)
 	 */
 
 	movq	%rsp, %rdi
+	subq	$8, %rsp
 	movq	$-1, %rsi
 	call	do_nmi
 
@@ -1475,7 +1481,7 @@  end_repeat_nmi:
 	call	paranoid_entry
 
 	/* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
-	movq	%rsp, %rdi
+	leaq	8(%rsp), %rdi
 	movq	$-1, %rsi
 	call	do_nmi
 
@@ -1519,7 +1525,7 @@  ENTRY(rewind_stack_do_exit)
 	xorl	%ebp, %ebp
 
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rax
-	leaq	-TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%rax), %rsp
+	leaq	-TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE-8(%rax), %rsp
 
 	call	do_exit
 1:	jmp 1b
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index e1721da..7d3f1e3 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -89,6 +89,8 @@  ENTRY(entry_SYSENTER_compat)
 	pushq   $0			/* pt_regs->r13 = 0 */
 	pushq   $0			/* pt_regs->r14 = 0 */
 	pushq   $0			/* pt_regs->r15 = 0 */
+
+	subq	$8, %rsp
 	cld
 
 	/*
@@ -120,7 +122,7 @@  ENTRY(entry_SYSENTER_compat)
 	 */
 	TRACE_IRQS_OFF
 
-	movq	%rsp, %rdi
+	leaq	8(%rsp), %rdi
 	call	do_fast_syscall_32
 	/* XEN PV guests always use IRET path */
 	ALTERNATIVE "testl %eax, %eax; jz .Lsyscall_32_done", \
@@ -215,13 +217,15 @@  ENTRY(entry_SYSCALL_compat)
 	pushq   $0			/* pt_regs->r14 = 0 */
 	pushq   $0			/* pt_regs->r15 = 0 */
 
+	subq	$8, %rsp
+
 	/*
 	 * User mode is traced as though IRQs are on, and SYSENTER
 	 * turned them off.
 	 */
 	TRACE_IRQS_OFF
 
-	movq	%rsp, %rdi
+	leaq	8(%rsp), %rdi
 	call	do_fast_syscall_32
 	/* XEN PV guests always use IRET path */
 	ALTERNATIVE "testl %eax, %eax; jz .Lsyscall_32_done", \
@@ -324,6 +328,8 @@  ENTRY(entry_INT80_compat)
 	pushq   %r13                    /* pt_regs->r13 */
 	pushq   %r14                    /* pt_regs->r14 */
 	pushq   %r15                    /* pt_regs->r15 */
+
+	subq	$8, %rsp
 	cld
 
 	/*
@@ -332,7 +338,7 @@  ENTRY(entry_INT80_compat)
 	 */
 	TRACE_IRQS_OFF
 
-	movq	%rsp, %rdi
+	leaq	8(%rsp), %rdi
 	call	do_int80_syscall_32
 .Lsyscall_32_done:
 
diff --git a/arch/x86/entry/thunk_64.S b/arch/x86/entry/thunk_64.S
index be36bf4..3c80aac 100644
--- a/arch/x86/entry/thunk_64.S
+++ b/arch/x86/entry/thunk_64.S
@@ -33,6 +33,7 @@ 
 	movq 8(%rbp), %rdi
 	.endif
 
+	sub $8, %rsp
 	call \func
 	jmp  .L_restore
 	_ASM_NOKPROBE(\name)
@@ -58,6 +59,7 @@ 
  || defined(CONFIG_DEBUG_LOCK_ALLOC) \
  || defined(CONFIG_PREEMPT)
 .L_restore:
+	add $8, %rsp
 	popq %r11
 	popq %r10
 	popq %r9
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index b467b14..d03ab72 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -384,6 +384,8 @@  early_idt_handler_common:
 	pushq %r14				/* pt_regs->r14 */
 	pushq %r15				/* pt_regs->r15 */
 
+	sub $8, %rsp
+
 	cmpq $14,%rsi		/* Page fault? */
 	jnz 10f
 	GET_CR2_INTO(%rdi)	/* Can clobber any volatile register if pv */
@@ -392,7 +394,7 @@  early_idt_handler_common:
 	jz 20f			/* All good */
 
 10:
-	movq %rsp,%rdi		/* RDI = pt_regs; RSI is already trapnr */
+	leaq 8(%rsp), %rdi	/* RDI = pt_regs; RSI is already trapnr */
 	call early_fixup_exception
 
 20:
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index bf0c6d0..2af9f81 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -590,6 +590,7 @@  asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs)
 
 struct bad_iret_stack {
 	void *error_entry_ret;
+	void *padding;
 	struct pt_regs regs;
 };