diff mbox series

[RFC,v3,08/31] KVM: selftests: Require GCC to realign stacks on function entry

Message ID 20230121001542.2472357-9-ackerleytng@google.com (mailing list archive)
State New
Headers show
Series TDX KVM selftests | expand

Commit Message

Ackerley Tng Jan. 21, 2023, 12:15 a.m. UTC
Some SSE instructions assume a 16-byte aligned stack, and GCC compiles
assuming the stack is aligned:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838. This combination
results in a #GP in guests.

Adding this compiler flag will generate an alternate prologue and
epilogue to realign the runtime stack, which makes selftest code
slower and bigger, but this is okay since we do not need selftest code
to be extremely performant.

Similar issue discussed at
https://lore.kernel.org/all/CAGtprH9yKvuaF5yruh3BupQe4BxDGiBQk3ExtY2m39yP-tppsg@mail.gmail.com/

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 tools/testing/selftests/kvm/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Sean Christopherson Jan. 21, 2023, 12:27 a.m. UTC | #1
On Sat, Jan 21, 2023, Ackerley Tng wrote:
> Some SSE instructions assume a 16-byte aligned stack, and GCC compiles
> assuming the stack is aligned:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838. This combination
> results in a #GP in guests.
> 
> Adding this compiler flag will generate an alternate prologue and
> epilogue to realign the runtime stack, which makes selftest code
> slower and bigger, but this is okay since we do not need selftest code
> to be extremely performant.

Huh, I had completely forgotten that this is why SSE is problematic.  I ran into
this with the base UPM selftests and just disabled SSE.  /facepalm.

We should figure out exactly what is causing a misaligned stack.  As you've noted,
the x86-64 ABI requires a 16-byte aligned RSP.  Unless I'm misreading vm_arch_vcpu_add(),
the starting stack should be page aligned, which means something is causing the
stack to become unaligned at runtime.  I'd rather hunt down that something than
paper over it by having the compiler force realignment.

> Similar issue discussed at
> https://lore.kernel.org/all/CAGtprH9yKvuaF5yruh3BupQe4BxDGiBQk3ExtY2m39yP-tppsg@mail.gmail.com/
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>  tools/testing/selftests/kvm/Makefile | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
> index 317927d9c55bd..5f9cc1e6ee67e 100644
> --- a/tools/testing/selftests/kvm/Makefile
> +++ b/tools/testing/selftests/kvm/Makefile
> @@ -205,7 +205,7 @@ LINUX_TOOL_ARCH_INCLUDE = $(top_srcdir)/tools/arch/x86/include
>  else
>  LINUX_TOOL_ARCH_INCLUDE = $(top_srcdir)/tools/arch/$(ARCH)/include
>  endif
> -CFLAGS += -Wall -Wstrict-prototypes -Wuninitialized -O2 -g -std=gnu99 \
> +CFLAGS += -mstackrealign -Wall -Wstrict-prototypes -Wuninitialized -O2 -g -std=gnu99 \
>  	-fno-stack-protector -fno-PIE -I$(LINUX_TOOL_INCLUDE) \
>  	-I$(LINUX_TOOL_ARCH_INCLUDE) -I$(LINUX_HDR_PATH) -Iinclude \
>  	-I$(<D) -Iinclude/$(UNAME_M) -I ../rseq -I.. $(EXTRA_CFLAGS) \
> -- 
> 2.39.0.246.g2a6d74b583-goog
>
Erdem Aktas Jan. 23, 2023, 6:30 p.m. UTC | #2
On Fri, Jan 20, 2023 at 4:28 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Sat, Jan 21, 2023, Ackerley Tng wrote:
> > Some SSE instructions assume a 16-byte aligned stack, and GCC compiles
> > assuming the stack is aligned:
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838. This combination
> > results in a #GP in guests.
> >
> > Adding this compiler flag will generate an alternate prologue and
> > epilogue to realign the runtime stack, which makes selftest code
> > slower and bigger, but this is okay since we do not need selftest code
> > to be extremely performant.
>
> Huh, I had completely forgotten that this is why SSE is problematic.  I ran into
> this with the base UPM selftests and just disabled SSE.  /facepalm.
>
> We should figure out exactly what is causing a misaligned stack.  As you've noted,
> the x86-64 ABI requires a 16-byte aligned RSP.  Unless I'm misreading vm_arch_vcpu_add(),
> the starting stack should be page aligned, which means something is causing the
> stack to become unaligned at runtime.  I'd rather hunt down that something than
> paper over it by having the compiler force realignment.

Is not it due to the 32bit execution part of the guest code at boot
time. Any push/pop of 32bit registers might make it a 16-byte
unaligned stack.

>
> > Similar issue discussed at
> > https://lore.kernel.org/all/CAGtprH9yKvuaF5yruh3BupQe4BxDGiBQk3ExtY2m39yP-tppsg@mail.gmail.com/
> >
> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > ---
> >  tools/testing/selftests/kvm/Makefile | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
> > index 317927d9c55bd..5f9cc1e6ee67e 100644
> > --- a/tools/testing/selftests/kvm/Makefile
> > +++ b/tools/testing/selftests/kvm/Makefile
> > @@ -205,7 +205,7 @@ LINUX_TOOL_ARCH_INCLUDE = $(top_srcdir)/tools/arch/x86/include
> >  else
> >  LINUX_TOOL_ARCH_INCLUDE = $(top_srcdir)/tools/arch/$(ARCH)/include
> >  endif
> > -CFLAGS += -Wall -Wstrict-prototypes -Wuninitialized -O2 -g -std=gnu99 \
> > +CFLAGS += -mstackrealign -Wall -Wstrict-prototypes -Wuninitialized -O2 -g -std=gnu99 \
> >       -fno-stack-protector -fno-PIE -I$(LINUX_TOOL_INCLUDE) \
> >       -I$(LINUX_TOOL_ARCH_INCLUDE) -I$(LINUX_HDR_PATH) -Iinclude \
> >       -I$(<D) -Iinclude/$(UNAME_M) -I ../rseq -I.. $(EXTRA_CFLAGS) \
> > --
> > 2.39.0.246.g2a6d74b583-goog
> >
Maciej S. Szmigiero Jan. 23, 2023, 6:50 p.m. UTC | #3
On 23.01.2023 19:30, Erdem Aktas wrote:
> On Fri, Jan 20, 2023 at 4:28 PM Sean Christopherson <seanjc@google.com> wrote:
>>
>> On Sat, Jan 21, 2023, Ackerley Tng wrote:
>>> Some SSE instructions assume a 16-byte aligned stack, and GCC compiles
>>> assuming the stack is aligned:
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838. This combination
>>> results in a #GP in guests.
>>>
>>> Adding this compiler flag will generate an alternate prologue and
>>> epilogue to realign the runtime stack, which makes selftest code
>>> slower and bigger, but this is okay since we do not need selftest code
>>> to be extremely performant.
>>
>> Huh, I had completely forgotten that this is why SSE is problematic.  I ran into
>> this with the base UPM selftests and just disabled SSE.  /facepalm.
>>
>> We should figure out exactly what is causing a misaligned stack.  As you've noted,
>> the x86-64 ABI requires a 16-byte aligned RSP.  Unless I'm misreading vm_arch_vcpu_add(),
>> the starting stack should be page aligned, which means something is causing the
>> stack to become unaligned at runtime.  I'd rather hunt down that something than
>> paper over it by having the compiler force realignment.
> 
> Is not it due to the 32bit execution part of the guest code at boot
> time. Any push/pop of 32bit registers might make it a 16-byte
> unaligned stack.

32-bit stack needs to be 16-byte aligned, too (at function call boundaries) -
see [1] chapter 2.2.2 "The Stack Frame"

Thanks,
Maciej

[1]: https://www.uclibc.org/docs/psABI-i386.pdf
Sean Christopherson Jan. 23, 2023, 6:53 p.m. UTC | #4
On Mon, Jan 23, 2023, Maciej S. Szmigiero wrote:
> On 23.01.2023 19:30, Erdem Aktas wrote:
> > On Fri, Jan 20, 2023 at 4:28 PM Sean Christopherson <seanjc@google.com> wrote:
> > > 
> > > On Sat, Jan 21, 2023, Ackerley Tng wrote:
> > > > Some SSE instructions assume a 16-byte aligned stack, and GCC compiles
> > > > assuming the stack is aligned:
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838. This combination
> > > > results in a #GP in guests.
> > > > 
> > > > Adding this compiler flag will generate an alternate prologue and
> > > > epilogue to realign the runtime stack, which makes selftest code
> > > > slower and bigger, but this is okay since we do not need selftest code
> > > > to be extremely performant.
> > > 
> > > Huh, I had completely forgotten that this is why SSE is problematic.  I ran into
> > > this with the base UPM selftests and just disabled SSE.  /facepalm.
> > > 
> > > We should figure out exactly what is causing a misaligned stack.  As you've noted,
> > > the x86-64 ABI requires a 16-byte aligned RSP.  Unless I'm misreading vm_arch_vcpu_add(),
> > > the starting stack should be page aligned, which means something is causing the
> > > stack to become unaligned at runtime.  I'd rather hunt down that something than
> > > paper over it by having the compiler force realignment.
> > 
> > Is not it due to the 32bit execution part of the guest code at boot
> > time. Any push/pop of 32bit registers might make it a 16-byte
> > unaligned stack.
> 
> 32-bit stack needs to be 16-byte aligned, too (at function call boundaries) -
> see [1] chapter 2.2.2 "The Stack Frame"

And this showing up in the non-TDX selftests rules that out as the sole problem;
the selftests stuff 64-bit mode, i.e. don't have 32-bit boot code.
Erdem Aktas Jan. 24, 2023, 12:04 a.m. UTC | #5
On Mon, Jan 23, 2023 at 10:53 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Jan 23, 2023, Maciej S. Szmigiero wrote:
> > On 23.01.2023 19:30, Erdem Aktas wrote:
> > > On Fri, Jan 20, 2023 at 4:28 PM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > On Sat, Jan 21, 2023, Ackerley Tng wrote:
> > > > > Some SSE instructions assume a 16-byte aligned stack, and GCC compiles
> > > > > assuming the stack is aligned:
> > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838. This combination
> > > > > results in a #GP in guests.
> > > > >
> > > > > Adding this compiler flag will generate an alternate prologue and
> > > > > epilogue to realign the runtime stack, which makes selftest code
> > > > > slower and bigger, but this is okay since we do not need selftest code
> > > > > to be extremely performant.
> > > >
> > > > Huh, I had completely forgotten that this is why SSE is problematic.  I ran into
> > > > this with the base UPM selftests and just disabled SSE.  /facepalm.
> > > >
> > > > We should figure out exactly what is causing a misaligned stack.  As you've noted,
> > > > the x86-64 ABI requires a 16-byte aligned RSP.  Unless I'm misreading vm_arch_vcpu_add(),
> > > > the starting stack should be page aligned, which means something is causing the
> > > > stack to become unaligned at runtime.  I'd rather hunt down that something than
> > > > paper over it by having the compiler force realignment.
> > >
> > > Is not it due to the 32bit execution part of the guest code at boot
> > > time. Any push/pop of 32bit registers might make it a 16-byte
> > > unaligned stack.
> >
> > 32-bit stack needs to be 16-byte aligned, too (at function call boundaries) -
> > see [1] chapter 2.2.2 "The Stack Frame"
>
> And this showing up in the non-TDX selftests rules that out as the sole problem;
> the selftests stuff 64-bit mode, i.e. don't have 32-bit boot code.

Thanks Maciej and Sean for the clarification. I was suspecting the
hand-coded assembly part that we have for TDX tests but  it being
happening in the non-TDX selftests disproves it.
Sean Christopherson Jan. 24, 2023, 1:21 a.m. UTC | #6
On Mon, Jan 23, 2023, Erdem Aktas wrote:
> On Mon, Jan 23, 2023 at 10:53 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Jan 23, 2023, Maciej S. Szmigiero wrote:
> > > On 23.01.2023 19:30, Erdem Aktas wrote:
> > > > On Fri, Jan 20, 2023 at 4:28 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > >
> > > > > On Sat, Jan 21, 2023, Ackerley Tng wrote:
> > > > > > Some SSE instructions assume a 16-byte aligned stack, and GCC compiles
> > > > > > assuming the stack is aligned:
> > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838. This combination
> > > > > > results in a #GP in guests.
> > > > > >
> > > > > > Adding this compiler flag will generate an alternate prologue and
> > > > > > epilogue to realign the runtime stack, which makes selftest code
> > > > > > slower and bigger, but this is okay since we do not need selftest code
> > > > > > to be extremely performant.
> > > > >
> > > > > Huh, I had completely forgotten that this is why SSE is problematic.  I ran into
> > > > > this with the base UPM selftests and just disabled SSE.  /facepalm.
> > > > >
> > > > > We should figure out exactly what is causing a misaligned stack.  As you've noted,
> > > > > the x86-64 ABI requires a 16-byte aligned RSP.  Unless I'm misreading vm_arch_vcpu_add(),
> > > > > the starting stack should be page aligned, which means something is causing the
> > > > > stack to become unaligned at runtime.  I'd rather hunt down that something than
> > > > > paper over it by having the compiler force realignment.
> > > >
> > > > Is not it due to the 32bit execution part of the guest code at boot
> > > > time. Any push/pop of 32bit registers might make it a 16-byte
> > > > unaligned stack.
> > >
> > > 32-bit stack needs to be 16-byte aligned, too (at function call boundaries) -
> > > see [1] chapter 2.2.2 "The Stack Frame"
> >
> > And this showing up in the non-TDX selftests rules that out as the sole problem;
> > the selftests stuff 64-bit mode, i.e. don't have 32-bit boot code.
> 
> Thanks Maciej and Sean for the clarification. I was suspecting the
> hand-coded assembly part that we have for TDX tests but  it being
> happening in the non-TDX selftests disproves it.

Not necessarily, it could be both.  Goofs in the handcoded assembly and PEBKAC
on my end :-)
Sean Christopherson Feb. 15, 2023, 10:24 p.m. UTC | #7
On Wed, Feb 15, 2023, Ackerley Tng wrote:
> I figured it out!
> 
> GCC assumes that the stack is 16-byte aligned **before** the call
> instruction. Since call pushes rip to the stack, GCC will compile code
> assuming that on entrance to the function, the stack is -8 from a
> 16-byte aligned address.
> 
> Since for TDs we do a ljmp to guest code, providing a function's
> address, the stack was not modified by a call instruction pushing rip to
> the stack, so the stack is 16-byte aligned when the guest code starts
> running, instead of 16-byte aligned -8 that GCC expects.
> 
> For VMs, we set rip to a function pointer, and the VM starts running
> with a 16-byte algined stack too.
> 
> To fix this, I propose that in vm_arch_vcpu_add(), we align the
> allocated stack address and then subtract 8 from that:
> 
> @@ -573,10 +573,13 @@ struct kvm_vcpu *vm_arch_vcpu_add(struct kvm_vm *vm,
> uint32_t vcpu_id,
>         vcpu_init_cpuid(vcpu, kvm_get_supported_cpuid());
>         vcpu_setup(vm, vcpu);
> 
> +       stack_vaddr += (DEFAULT_STACK_PGS * getpagesize());
> +       stack_vaddr = ALIGN_DOWN(stack_vaddr, 16) - 8;

The ALIGN_DOWN should be unnecessary, we've got larger issues if getpagesize() isn't
16-byte aligned and/or if __vm_vaddr_alloc() returns anything but a page-aligned
address.  Maybe add a TEST_ASSERT() sanity check that stack_vaddr is page-aligned
at this point?

And in addition to the comment suggested by Maciej, can you also add a comment
explaining the -8 adjust?  Yeah, someone can go read the changelog, but I think
this is worth explicitly documenting in code.

Lastly, can you post it as a standalone patch?

Many thanks!
diff mbox series

Patch

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 317927d9c55bd..5f9cc1e6ee67e 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -205,7 +205,7 @@  LINUX_TOOL_ARCH_INCLUDE = $(top_srcdir)/tools/arch/x86/include
 else
 LINUX_TOOL_ARCH_INCLUDE = $(top_srcdir)/tools/arch/$(ARCH)/include
 endif
-CFLAGS += -Wall -Wstrict-prototypes -Wuninitialized -O2 -g -std=gnu99 \
+CFLAGS += -mstackrealign -Wall -Wstrict-prototypes -Wuninitialized -O2 -g -std=gnu99 \
 	-fno-stack-protector -fno-PIE -I$(LINUX_TOOL_INCLUDE) \
 	-I$(LINUX_TOOL_ARCH_INCLUDE) -I$(LINUX_HDR_PATH) -Iinclude \
 	-I$(<D) -Iinclude/$(UNAME_M) -I ../rseq -I.. $(EXTRA_CFLAGS) \