[V2,07/17] x86/entry/64: Remove redundant interrupt disable

Message ID	20191023123118.296135499@linutronix.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=2s9g=YQ=vger.kernel.org=kvm-owner@kernel.org> Message-Id: <20191023123118.296135499@linutronix.de> User-Agent: quilt/0.65 Date: Wed, 23 Oct 2019 14:27:12 +0200 From: Thomas Gleixner <tglx@linutronix.de> To: LKML <linux-kernel@vger.kernel.org> Cc: x86@kernel.org, Peter Zijlstra <peterz@infradead.org>, Andy Lutomirski <luto@kernel.org>, Will Deacon <will@kernel.org>, Paolo Bonzini <pbonzini@redhat.com>, kvm@vger.kernel.org, linux-arch@vger.kernel.org, Mike Rapoport <rppt@linux.ibm.com>, Josh Poimboeuf <jpoimboe@redhat.com>, Miroslav Benes <mbenes@suse.cz> Subject: [patch V2 07/17] x86/entry/64: Remove redundant interrupt disable References: <20191023122705.198339581@linutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Sender: kvm-owner@vger.kernel.org Precedence: bulk
Series	entry: Provide generic implementation for host and guest entry/exit work \| expand [V2,00/17] entry: Provide generic implementation for host and guest entry/exit work [V2,01/17] x86/entry/32: Remove unused resume_userspace label [V2,02/17] x86/entry/64: Remove pointless jump in paranoid_exit [V2,03/17] x86/traps: Remove pointless irq enable from do_spurious_interrupt_bug() [V2,04/17] x86/entry: Make DEBUG_ENTRY_ASSERT_IRQS_OFF available for 32bit [V2,05/17] x86/traps: Make interrupt enable/disable symmetric in C code [V2,06/17] x86/entry/32: Remove redundant interrupt disable [V2,07/17] x86/entry/64: Remove redundant interrupt disable [V2,08/17] x86/entry: Move syscall irq tracing to C code [V2,09/17] x86/entry: Remove _TIF_NOHZ from _TIF_WORK_SYSCALL_ENTRY [V2,10/17] entry: Provide generic syscall entry functionality [V2,11/17] x86/entry: Use generic syscall entry function [V2,12/17] entry: Provide generic syscall exit function [V2,13/17] x86/entry: Use generic syscall exit functionality [V2,14/17] entry: Provide generic exit to usermode functionality [V2,15/17] x86/entry: Use generic exit to usermode [V2,16/17] kvm/workpending: Provide infrastructure for work before entering a guest [V2,17/17] x86/kvm: Use generic exit to guest work function

Thomas Gleixner Oct. 23, 2019, 12:27 p.m. UTC

Now that the trap handlers return with interrupts disabled, the
unconditional disabling of interrupts in the low level entry code can be
removed along with the trace calls.

Add debug checks where appropriate.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/entry_64.S |    9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

Sean Christopherson Oct. 23, 2019, 2:20 p.m. UTC | #1

On Wed, Oct 23, 2019 at 02:27:12PM +0200, Thomas Gleixner wrote:
> Now that the trap handlers return with interrupts disabled, the
> unconditional disabling of interrupts in the low level entry code can be
> removed along with the trace calls.
> 
> Add debug checks where appropriate.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---

Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>

Josh Poimboeuf Oct. 23, 2019, 10:06 p.m. UTC | #2

On Wed, Oct 23, 2019 at 02:27:12PM +0200, Thomas Gleixner wrote:
> Now that the trap handlers return with interrupts disabled, the
> unconditional disabling of interrupts in the low level entry code can be
> removed along with the trace calls.
> 
> Add debug checks where appropriate.

This seems a little scary.  Does anybody other than Andy actually run
with CONFIG_DEBUG_ENTRY?  What happens if somebody accidentally leaves
irqs enabled?  How do we know you found all the leaks?

Thomas Gleixner Oct. 23, 2019, 11:52 p.m. UTC | #3

On Wed, 23 Oct 2019, Josh Poimboeuf wrote:

> On Wed, Oct 23, 2019 at 02:27:12PM +0200, Thomas Gleixner wrote:
> > Now that the trap handlers return with interrupts disabled, the
> > unconditional disabling of interrupts in the low level entry code can be
> > removed along with the trace calls.
> > 
> > Add debug checks where appropriate.
> 
> This seems a little scary.  Does anybody other than Andy actually run
> with CONFIG_DEBUG_ENTRY?

I do.

> What happens if somebody accidentally leaves irqs enabled?  How do we
> know you found all the leaks?

For the DO_ERROR() ones that's trivial:

 #define DO_ERROR(trapnr, signr, sicode, addr, str, name)                  \
 dotraplinkage void do_##name(struct pt_regs *regs, long error_code)	   \
 {									   \
 	do_error_trap(regs, error_code, str, trapnr, signr, sicode, addr); \
+	lockdep_assert_irqs_disabled();					   \
 }
 
 DO_ERROR(X86_TRAP_DE,     SIGFPE,  FPE_INTDIV,   IP, "divide error",        divide_error)

Now for the rest we surely could do:

dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
{
	__do_bounds(regs, error_code);
	lockdep_assert_irqs_disabled();
}

and move the existing body into a static function so independent of any
(future) return path there the lockdep assert will be invoked.

Thanks,

	tglx

Andy Lutomirski Oct. 24, 2019, 4:18 p.m. UTC | #4

On Wed, Oct 23, 2019 at 4:52 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Wed, 23 Oct 2019, Josh Poimboeuf wrote:
>
> > On Wed, Oct 23, 2019 at 02:27:12PM +0200, Thomas Gleixner wrote:
> > > Now that the trap handlers return with interrupts disabled, the
> > > unconditional disabling of interrupts in the low level entry code can be
> > > removed along with the trace calls.
> > >
> > > Add debug checks where appropriate.
> >
> > This seems a little scary.  Does anybody other than Andy actually run
> > with CONFIG_DEBUG_ENTRY?
>
> I do.
>
> > What happens if somebody accidentally leaves irqs enabled?  How do we
> > know you found all the leaks?
>
> For the DO_ERROR() ones that's trivial:
>
>  #define DO_ERROR(trapnr, signr, sicode, addr, str, name)                  \
>  dotraplinkage void do_##name(struct pt_regs *regs, long error_code)       \
>  {                                                                         \
>         do_error_trap(regs, error_code, str, trapnr, signr, sicode, addr); \
> +       lockdep_assert_irqs_disabled();                                    \
>  }
>
>  DO_ERROR(X86_TRAP_DE,     SIGFPE,  FPE_INTDIV,   IP, "divide error",        divide_error)
>
> Now for the rest we surely could do:
>
> dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
> {
>         __do_bounds(regs, error_code);
>         lockdep_assert_irqs_disabled();
> }
>
> and move the existing body into a static function so independent of any
> (future) return path there the lockdep assert will be invoked.
>

If we do this, can we macro-ize it:

DEFINE_IDTENTRY_HANDLER(do_bounds)
{
 ...
}

If you do this, please don't worry about the weird ones that take cr2
as a third argument.  Once your series lands, I will send a follow-up
to get rid of it.  It's 2/3 written already.

Thomas Gleixner Oct. 24, 2019, 8:52 p.m. UTC | #5

On Thu, 24 Oct 2019, Andy Lutomirski wrote:
> On Wed, Oct 23, 2019 at 4:52 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Wed, 23 Oct 2019, Josh Poimboeuf wrote:
> > > What happens if somebody accidentally leaves irqs enabled?  How do we
> > > know you found all the leaks?
> >
> > For the DO_ERROR() ones that's trivial:
> >
> >  #define DO_ERROR(trapnr, signr, sicode, addr, str, name)                  \
> >  dotraplinkage void do_##name(struct pt_regs *regs, long error_code)       \
> >  {                                                                         \
> >         do_error_trap(regs, error_code, str, trapnr, signr, sicode, addr); \
> > +       lockdep_assert_irqs_disabled();                                    \
> >  }
> >
> >  DO_ERROR(X86_TRAP_DE,     SIGFPE,  FPE_INTDIV,   IP, "divide error",        divide_error)
> >
> > Now for the rest we surely could do:
> >
> > dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
> > {
> >         __do_bounds(regs, error_code);
> >         lockdep_assert_irqs_disabled();
> > }
> >
> > and move the existing body into a static function so independent of any
> > (future) return path there the lockdep assert will be invoked.
> >
> 
> If we do this, can we macro-ize it:
> 
> DEFINE_IDTENTRY_HANDLER(do_bounds)
> {
>  ...
> }
>  
> If you do this, please don't worry about the weird ones that take cr2
> as a third argument.  Once your series lands, I will send a follow-up
> to get rid of it.  It's 2/3 written already.

I spent quite some time digging deeper into this. Finding all corner cases
which eventually enable interrupts from an exception handler is not as
trivial as it looked in the first place. Especially the fault handler is a
nightmare. Also PeterZ's approach of doing

	   if (regs->eflags & IF)
	   	local_irq_disable();

is doomed due to sys_iopl(). See below.

I'm tempted to do pretty much the same thing as the syscall rework did
as a first step:

  - Move the actual handler invocation to C

  - Do the irq tracing on entry in C

  - Move irq disable before return to ASM

Peter gave me some half finished patches which pretty much do that by
copying half of the linux/syscalls.h macro maze into the entry code. That's
one possible solution, but TBH it sucks big times.

We have the following variants:

do_divide_error(struct pt_regs *regs, long error_code);
do_debug(struct pt_regs *regs, long error_code);
do_nmi(struct pt_regs *regs, long error_code);
do_int3(struct pt_regs *regs, long error_code);
do_overflow(struct pt_regs *regs, long error_code);
do_bounds(struct pt_regs *regs, long error_code);
do_invalid_op(struct pt_regs *regs, long error_code);
do_device_not_available(struct pt_regs *regs, long error_code);
do_coprocessor_segment_overrun(struct pt_regs *regs, long error_code);
do_invalid_TSS(struct pt_regs *regs, long error_code);
do_segment_not_present(struct pt_regs *regs, long error_code);
do_stack_segment(struct pt_regs *regs, long error_code);
do_general_protection(struct pt_regs *regs, long error_code);
do_spurious_interrupt_bug(struct pt_regs *regs, long error_code);
do_coprocessor_error(struct pt_regs *regs, long error_code);
do_alignment_check(struct pt_regs *regs, long error_code);
do_machine_check(struct pt_regs *regs, long error_code);
do_simd_coprocessor_error(struct pt_regs *regs, long error_code);
do_iret_error(struct pt_regs *regs, long error_code);
do_mce(struct pt_regs *regs, long error_code);

do_async_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address);
do_double_fault(struct pt_regs *regs, long error_code, unsigned long address);
do_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address);

So if we can remove the third argument then we can spare most of the macro
maze and just have one common function without bells and whistels. The
other option would be to extend all handlers to have three arguments,
i.e. add 'long unused', which is not pretty either.

What's your plan with cr2? Stash it in pt_regs or something else?

Once we have the interesting parts in C then we can revisit the elimination
of the unconditional irq disable because in C it's way simpler to do
diagnostics, but I'm not entirely sure whether it's worth it.

A related issue is the inconsistency of the irq disabled tracing in the
return to user path. As I pointed out in the other mail, the various
syscall implementations do that differently. The exception handlers do it
always conditional, regular interrupts as well. For regular interrupts that
does not make sense as they can by all means never return to an interrupt
disabled context.

The interesting bells and whistels result from sys_iopl(). If user space
has been granted iopl(level = 3) it gains cli/sti priviledges. When the
application has interrupts disabled in userspace:

  - invocation of a syscall

  - any exception (aside of NMI/MCE) which conditionally enables interrupts
    depending on user_mode(regs) and therefor can be preempted and
    schedule

is just undefined behaviour and I personally consider it to be a plain bug.

Just for the record: This results in running a resulting or even completely
unrelated signal handler with interrupts disabled as well.

Whatever we decide it is, leaving it completely inconsistent is not a
solution at all. The options are:

  1)  Always do conditional tracing depending on the user_regs->eflags.IF
      state.

  2)  #1 + warn once when syscalls and exceptions (except NMI/MCE) happen
      and user_regs->eflags.IF is cleared.

  3a) #2 + enforce signal handling to run with interrupts enabled.

  3b) #2 + set regs->eflags.IF. So the state is always correct from the
      kernel POV. Of course that changes existing behaviour, but its
      changing undefined and inconsistent behaviour.

  4) Let iopl(level) return -EPERM if level == 3.

     Yeah, I know it's not possible due to regressions (DPKD uses iopl(3)),
     but TBH that'd be the sanest option of all.

     Of course the infinite wisdom of hardware designers tied IN, INS, OUT,
     OUTS and CLI/STI together on IOPL so we cannot even distangle them in
     any way.

     The only way out would be to actually use a full 8K sized I/O bitmap,
     but that's a massive pain as it has to be copied on every context
     switch. 

Really pretty options to chose from ...

Thanks,

	tglx

Thomas Gleixner Oct. 24, 2019, 8:59 p.m. UTC | #6

On Thu, 24 Oct 2019, Thomas Gleixner wrote:
> Whatever we decide it is, leaving it completely inconsistent is not a
> solution at all. The options are:

Actually there is also:

    0) Always do unconditional trace_irqs_on().

       But that does not allow to actually trace the real return flags
       state which might be useful to diagnose crap which results from user
       space CLI.
 
>   1)  Always do conditional tracing depending on the user_regs->eflags.IF
>       state.
> 
>   2)  #1 + warn once when syscalls and exceptions (except NMI/MCE) happen
>       and user_regs->eflags.IF is cleared.
> 
>   3a) #2 + enforce signal handling to run with interrupts enabled.
> 
>   3b) #2 + set regs->eflags.IF. So the state is always correct from the
>       kernel POV. Of course that changes existing behaviour, but its
>       changing undefined and inconsistent behaviour.
>   
>   4) Let iopl(level) return -EPERM if level == 3.
> 
>      Yeah, I know it's not possible due to regressions (DPKD uses iopl(3)),
>      but TBH that'd be the sanest option of all.
> 
>      Of course the infinite wisdom of hardware designers tied IN, INS, OUT,
>      OUTS and CLI/STI together on IOPL so we cannot even distangle them in
>      any way.
> 
>      The only way out would be to actually use a full 8K sized I/O bitmap,
>      but that's a massive pain as it has to be copied on every context
>      switch. 
> 
> Really pretty options to chose from ...
> 
> Thanks,
> 
> 	tglx
>

Peter Zijlstra Oct. 24, 2019, 9:21 p.m. UTC | #7

On Thu, Oct 24, 2019 at 10:52:59PM +0200, Thomas Gleixner wrote:
> is just undefined behaviour and I personally consider it to be a plain bug.

I concur.

> Just for the record: This results in running a resulting or even completely
> unrelated signal handler with interrupts disabled as well.
> 
> Whatever we decide it is, leaving it completely inconsistent is not a
> solution at all. The options are:
> 
>   1)  Always do conditional tracing depending on the user_regs->eflags.IF
>       state.
> 
>   2)  #1 + warn once when syscalls and exceptions (except NMI/MCE) happen
>       and user_regs->eflags.IF is cleared.
> 
>   3a) #2 + enforce signal handling to run with interrupts enabled.
> 
>   3b) #2 + set regs->eflags.IF. So the state is always correct from the
>       kernel POV. Of course that changes existing behaviour, but its
>       changing undefined and inconsistent behaviour.
>   
>   4) Let iopl(level) return -EPERM if level == 3.
> 
>      Yeah, I know it's not possible due to regressions (DPKD uses iopl(3)),
>      but TBH that'd be the sanest option of all.
> 
>      Of course the infinite wisdom of hardware designers tied IN, INS, OUT,
>      OUTS and CLI/STI together on IOPL so we cannot even distangle them in
>      any way.
> 
>      The only way out would be to actually use a full 8K sized I/O bitmap,
>      but that's a massive pain as it has to be copied on every context
>      switch. 
> 
> Really pretty options to chose from ...

If 4 is out (and I'm afraid it might be), then I'm on record for liking
3b.

Andy Lutomirski Oct. 24, 2019, 9:24 p.m. UTC | #8

On Thu, Oct 24, 2019 at 1:53 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Thu, 24 Oct 2019, Andy Lutomirski wrote:
> > On Wed, Oct 23, 2019 at 4:52 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > > On Wed, 23 Oct 2019, Josh Poimboeuf wrote:
> > > > What happens if somebody accidentally leaves irqs enabled?  How do we
> > > > know you found all the leaks?
> > >
> > > For the DO_ERROR() ones that's trivial:
> > >
> > >  #define DO_ERROR(trapnr, signr, sicode, addr, str, name)                  \
> > >  dotraplinkage void do_##name(struct pt_regs *regs, long error_code)       \
> > >  {                                                                         \
> > >         do_error_trap(regs, error_code, str, trapnr, signr, sicode, addr); \
> > > +       lockdep_assert_irqs_disabled();                                    \
> > >  }
> > >
> > >  DO_ERROR(X86_TRAP_DE,     SIGFPE,  FPE_INTDIV,   IP, "divide error",        divide_error)
> > >
> > > Now for the rest we surely could do:
> > >
> > > dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
> > > {
> > >         __do_bounds(regs, error_code);
> > >         lockdep_assert_irqs_disabled();
> > > }
> > >
> > > and move the existing body into a static function so independent of any
> > > (future) return path there the lockdep assert will be invoked.
> > >
> >
> > If we do this, can we macro-ize it:
> >
> > DEFINE_IDTENTRY_HANDLER(do_bounds)
> > {
> >  ...
> > }
> >
> > If you do this, please don't worry about the weird ones that take cr2
> > as a third argument.  Once your series lands, I will send a follow-up
> > to get rid of it.  It's 2/3 written already.
>
> I spent quite some time digging deeper into this. Finding all corner cases
> which eventually enable interrupts from an exception handler is not as
> trivial as it looked in the first place. Especially the fault handler is a
> nightmare. Also PeterZ's approach of doing
>
>            if (regs->eflags & IF)
>                 local_irq_disable();
>
> is doomed due to sys_iopl(). See below.

I missed something in the discussion.  What breaks?  Can you check
user_mode(regs) too?

>
> I'm tempted to do pretty much the same thing as the syscall rework did
> as a first step:
>
>   - Move the actual handler invocation to C
>
>   - Do the irq tracing on entry in C
>
>   - Move irq disable before return to ASM
>
> Peter gave me some half finished patches which pretty much do that by
> copying half of the linux/syscalls.h macro maze into the entry code. That's
> one possible solution, but TBH it sucks big times.
>
> We have the following variants:
>
> do_divide_error(struct pt_regs *regs, long error_code);
> do_debug(struct pt_regs *regs, long error_code);
> do_nmi(struct pt_regs *regs, long error_code);
> do_int3(struct pt_regs *regs, long error_code);
> do_overflow(struct pt_regs *regs, long error_code);
> do_bounds(struct pt_regs *regs, long error_code);
> do_invalid_op(struct pt_regs *regs, long error_code);
> do_device_not_available(struct pt_regs *regs, long error_code);
> do_coprocessor_segment_overrun(struct pt_regs *regs, long error_code);
> do_invalid_TSS(struct pt_regs *regs, long error_code);
> do_segment_not_present(struct pt_regs *regs, long error_code);
> do_stack_segment(struct pt_regs *regs, long error_code);
> do_general_protection(struct pt_regs *regs, long error_code);
> do_spurious_interrupt_bug(struct pt_regs *regs, long error_code);
> do_coprocessor_error(struct pt_regs *regs, long error_code);
> do_alignment_check(struct pt_regs *regs, long error_code);
> do_machine_check(struct pt_regs *regs, long error_code);
> do_simd_coprocessor_error(struct pt_regs *regs, long error_code);
> do_iret_error(struct pt_regs *regs, long error_code);
> do_mce(struct pt_regs *regs, long error_code);
>
> do_async_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address);
> do_double_fault(struct pt_regs *regs, long error_code, unsigned long address);
> do_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address);
>
> So if we can remove the third argument then we can spare most of the macro
> maze and just have one common function without bells and whistels. The
> other option would be to extend all handlers to have three arguments,
> i.e. add 'long unused', which is not pretty either.
>
> What's your plan with cr2? Stash it in pt_regs or something else?

Just read it from CR2.  I added a new idtentry macro arg called
"entry_work", and setting it to 0 causes the enter_from_user_mode to
be skipped.  Then C code calls enter_from_user_mode() after reading
CR2 (and DR7).  WIP code is here:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/idtentry

The idea is that, if everything is converted, then we get rid of the
entry_work=1 case, which is easier if there's a macro.

So my suggestion is to use a macro for the 2-arg version and open-code
all the 3-arg cases.  Then, when the dust settles, we get rid of the
third arg and they can use the macro.

>
> The interesting bells and whistels result from sys_iopl(). If user space
> has been granted iopl(level = 3) it gains cli/sti priviledges. When the
> application has interrupts disabled in userspace:
>
>   - invocation of a syscall
>
>   - any exception (aside of NMI/MCE) which conditionally enables interrupts
>     depending on user_mode(regs) and therefor can be preempted and
>     schedule
>
> is just undefined behaviour and I personally consider it to be a plain bug.
>
> Just for the record: This results in running a resulting or even completely
> unrelated signal handler with interrupts disabled as well.

I am seriously tempted to say that the solution is to remove iopl(),
at least on 64-bit kernels.  Doing STI in user mode is BS :)

Otherwise we need to give it semantics, no?  I personally have no
actual problem with the fact that an NMI can cause scheduling to
happen.  Big fscking deal.

>
> Whatever we decide it is, leaving it completely inconsistent is not a
> solution at all. The options are:
>
>   1)  Always do conditional tracing depending on the user_regs->eflags.IF
>       state.

I'm okay with always tracing like user mode means IRQs on or doing it
"correctly".  I consider the former to be simpler and therefore quite
possibly better.

>
>   2)  #1 + warn once when syscalls and exceptions (except NMI/MCE) happen
>       and user_regs->eflags.IF is cleared.
>
>   3a) #2 + enforce signal handling to run with interrupts enabled.
>
>   3b) #2 + set regs->eflags.IF. So the state is always correct from the
>       kernel POV. Of course that changes existing behaviour, but its
>       changing undefined and inconsistent behaviour.
>
>   4) Let iopl(level) return -EPERM if level == 3.
>
>      Yeah, I know it's not possible due to regressions (DPKD uses iopl(3)),
>      but TBH that'd be the sanest option of all.
>
>      Of course the infinite wisdom of hardware designers tied IN, INS, OUT,
>      OUTS and CLI/STI together on IOPL so we cannot even distangle them in
>      any way.

>
>      The only way out would be to actually use a full 8K sized I/O bitmap,
>      but that's a massive pain as it has to be copied on every context
>      switch.

Hmm.  This actually doesn't seem that bad.  We already have a TIF_
flag to optimize this.  So basically iopl() would effectively become
ioperm(everything on).

Thomas Gleixner Oct. 24, 2019, 10:33 p.m. UTC | #9

On Thu, 24 Oct 2019, Andy Lutomirski wrote:
> On Thu, Oct 24, 2019 at 1:53 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > I spent quite some time digging deeper into this. Finding all corner cases
> > which eventually enable interrupts from an exception handler is not as
> > trivial as it looked in the first place. Especially the fault handler is a
> > nightmare. Also PeterZ's approach of doing
> >
> >            if (regs->eflags & IF)
> >                 local_irq_disable();
> >
> > is doomed due to sys_iopl(). See below.
> 
> I missed something in the discussion.  What breaks?

Assume user space has issued CLI then the above check is giving the wrong
answer because it assumes that all faults in user mode have IF set.

> Can you check user_mode(regs) too?

Yes, but I still hate it with a passion :)

> > What's your plan with cr2? Stash it in pt_regs or something else?
> 
> Just read it from CR2.  I added a new idtentry macro arg called
> "entry_work", and setting it to 0 causes the enter_from_user_mode to
> be skipped.  Then C code calls enter_from_user_mode() after reading
> CR2 (and DR7).  WIP code is here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/idtentry
> 
> The idea is that, if everything is converted, then we get rid of the
> entry_work=1 case, which is easier if there's a macro.
> 
> So my suggestion is to use a macro for the 2-arg version and open-code
> all the 3-arg cases.  Then, when the dust settles, we get rid of the
> third arg and they can use the macro.

I'll have a look tomorrow with brain awake.
 
> > The interesting bells and whistels result from sys_iopl(). If user space
> > has been granted iopl(level = 3) it gains cli/sti priviledges. When the
> > application has interrupts disabled in userspace:
> >
> >   - invocation of a syscall
> >
> >   - any exception (aside of NMI/MCE) which conditionally enables interrupts
> >     depending on user_mode(regs) and therefor can be preempted and
> >     schedule
> >
> > is just undefined behaviour and I personally consider it to be a plain bug.
> >
> > Just for the record: This results in running a resulting or even completely
> > unrelated signal handler with interrupts disabled as well.
> 
> I am seriously tempted to say that the solution is to remove iopl(),
> at least on 64-bit kernels.  Doing STI in user mode is BS :)

STI would be halfways sane. CLI is the problem. And yes I agree it's BS :)

> Otherwise we need to give it semantics, no?  I personally have no
> actual problem with the fact that an NMI can cause scheduling to
> happen.  Big fscking deal.

Right, I don't care either. I do neither care that any exception/syscall
which hits a user space CLI region might schedule. It's been that way
forever.

But giving this semantics is insanely hard at least if you want sensible,
useful, consistent and understandable semantics. I know that's overrated.

> > Whatever we decide it is, leaving it completely inconsistent is not a
> > solution at all. The options are:
> >
> >   1)  Always do conditional tracing depending on the user_regs->eflags.IF
> >       state.
> 
> I'm okay with always tracing like user mode means IRQs on or doing it
> "correctly".  I consider the former to be simpler and therefore quite
> possibly better.
> 
> >
> >   2)  #1 + warn once when syscalls and exceptions (except NMI/MCE) happen
> >       and user_regs->eflags.IF is cleared.
> >
> >   3a) #2 + enforce signal handling to run with interrupts enabled.
> >
> >   3b) #2 + set regs->eflags.IF. So the state is always correct from the
> >       kernel POV. Of course that changes existing behaviour, but its
> >       changing undefined and inconsistent behaviour.
> >
> >   4) Let iopl(level) return -EPERM if level == 3.
> >
> >      Yeah, I know it's not possible due to regressions (DPKD uses iopl(3)),
> >      but TBH that'd be the sanest option of all.
> >
> >      Of course the infinite wisdom of hardware designers tied IN, INS, OUT,
> >      OUTS and CLI/STI together on IOPL so we cannot even distangle them in
> >      any way.
> 
> >
> >      The only way out would be to actually use a full 8K sized I/O bitmap,
> >      but that's a massive pain as it has to be copied on every context
> >      switch.
> 
> Hmm.  This actually doesn't seem that bad.  We already have a TIF_
> flag to optimize this.  So basically iopl() would effectively become
> ioperm(everything on).

Yes, and the insane user space would:

     1) Pay the latency price for copying 8K bitmap on every context switch
     	IN

     2) Inflict latency on the next task due to requiring memset of 8K
     	bitmap on every context switch OUT

     3) #GP when issuing CLI/STI

I personally have no problem with that. #1 and #3 are sane and as iopl()
requires CAP_RAW_IO it's not available to Joe User, so the sysadmin is
responsible for eventual issues resulting from #2.

Though the no-regression hammer might pound on #3 as it breaks random
engineering trainwrecks from hell.

#1/#2 could be easily mitigated though.

      struct tss_struct {
      	struct x86_hw_tss       x86_tss;
	unsigned long           io_bitmap[IO_BITMAP_LONGS + 1];
      };

and x86_tss has

    u16	io_bitmap_base;

which is either set to

  INVALID_IO_BITMAP_OFFSET ( 0x8000 )

or

  IO_BITMAP_OFFSET
    (offsetof(struct tss_struct, io_bitmap) - offsetof(struct tss_struct, x86_tss))

So we could add

	unsigned long           io_bitmap_all[IO_BITMAP_LONGS + 1];

and just set the base to this one.

But that involves also upping __KERNEL_TSS_LIMIT. Too tired to think about
the implications of that right now.

Thanks,

	tglx

Alexandre Chartre Nov. 8, 2019, 11:07 a.m. UTC | #10

On 10/23/19 2:27 PM, Thomas Gleixner wrote:
> Now that the trap handlers return with interrupts disabled, the
> unconditional disabling of interrupts in the low level entry code can be
> removed along with the trace calls.
> 
> Add debug checks where appropriate.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   arch/x86/entry/entry_64.S |    9 +++------
>   1 file changed, 3 insertions(+), 6 deletions(-)
> 

Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>

alex.

[V2,07/17] x86/entry/64: Remove redundant interrupt disable

Commit Message

Comments

Patch