[RFC,05/13] x86/irq: Reserve a user IPI notification vector

Message ID	20210913200132.3396598-6-sohil.mehta@intel.com (mailing list archive)
State	New
Headers	show Return-Path: <linux-kselftest-owner@kernel.org> From: Sohil Mehta <sohil.mehta@intel.com> To: x86@kernel.org Cc: Sohil Mehta <sohil.mehta@intel.com>, Tony Luck <tony.luck@intel.com>, Dave Hansen <dave.hansen@intel.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, "H . Peter Anvin" <hpa@zytor.com>, Andy Lutomirski <luto@kernel.org>, Jens Axboe <axboe@kernel.dk>, Christian Brauner <christian@brauner.io>, Peter Zijlstra <peterz@infradead.org>, Shuah Khan <shuah@kernel.org>, Arnd Bergmann <arnd@arndb.de>, Jonathan Corbet <corbet@lwn.net>, Ashok Raj <ashok.raj@intel.com>, Jacob Pan <jacob.jun.pan@linux.intel.com>, Gayatri Kammela <gayatri.kammela@intel.com>, Zeng Guang <guang.zeng@intel.com>, Dan Williams <dan.j.williams@intel.com>, Randy E Witt <randy.e.witt@intel.com>, Ravi V Shankar <ravi.v.shankar@intel.com>, Ramesh Thomas <ramesh.thomas@intel.com>, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org Subject: [RFC PATCH 05/13] x86/irq: Reserve a user IPI notification vector Date: Mon, 13 Sep 2021 13:01:24 -0700 Message-Id: <20210913200132.3396598-6-sohil.mehta@intel.com> In-Reply-To: <20210913200132.3396598-1-sohil.mehta@intel.com> References: <20210913200132.3396598-1-sohil.mehta@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	x86 User Interrupts support \| expand [RFC,00/13] x86 User Interrupts support [RFC,01/13] x86/uintr/man-page: Include man pages draft for reference [RFC,02/13] Documentation/x86: Add documentation for User Interrupts [RFC,03/13] x86/cpu: Enumerate User Interrupts support [RFC,04/13] x86/fpu/xstate: Enumerate User Interrupts supervisor state [RFC,05/13] x86/irq: Reserve a user IPI notification vector [RFC,06/13] x86/uintr: Introduce uintr receiver syscalls [RFC,07/13] x86/process/64: Add uintr task context switch support [RFC,08/13] x86/process/64: Clean up uintr task fork and exit paths [RFC,09/13] x86/uintr: Introduce vector registration and uintr_fd syscall [RFC,10/13] x86/uintr: Introduce user IPI sender syscalls [RFC,11/13] x86/uintr: Introduce uintr_wait() syscall [RFC,12/13] x86/uintr: Wire up the user interrupt syscalls [RFC,13/13] selftests/x86: Add basic tests for User IPI

Sohil Mehta Sept. 13, 2021, 8:01 p.m. UTC

A user interrupt notification vector is used on the receiver's cpu to
identify an interrupt as a user interrupt (and not a kernel interrupt).
Hardware uses the same notification vector to generate an IPI from a
sender's cpu core when the SENDUIPI instruction is executed.

Typically, the kernel shouldn't receive an interrupt with this vector.
However, it is possible that the kernel might receive this vector.

Scenario that can cause the spurious interrupt:

Step	cpu 0 (receiver task)		cpu 1 (sender task)
----	---------------------		-------------------
1	task is running
2					executes SENDUIPI
3					IPI sent
4	context switched out
5	IPI delivered
	(kernel interrupt detected)

A kernel interrupt can be detected, if a receiver task gets scheduled
out after the SENDUIPI-based IPI was sent but before the IPI was
delivered.

The kernel doesn't need to do anything in this case other than receiving
the interrupt and clearing the local APIC. The user interrupt is always
stored in the receiver's UPID before the IPI is generated. When the
receiver gets scheduled back the interrupt would be delivered based on
its UPID.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
---
 arch/x86/include/asm/hardirq.h     |  3 +++
 arch/x86/include/asm/idtentry.h    |  4 ++++
 arch/x86/include/asm/irq_vectors.h |  5 ++++-
 arch/x86/kernel/idt.c              |  3 +++
 arch/x86/kernel/irq.c              | 33 ++++++++++++++++++++++++++++++
 5 files changed, 47 insertions(+), 1 deletion(-)

Thomas Gleixner Sept. 23, 2021, 11:07 p.m. UTC | #1

On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:
> A user interrupt notification vector is used on the receiver's cpu to
> identify an interrupt as a user interrupt (and not a kernel interrupt).
> Hardware uses the same notification vector to generate an IPI from a
> sender's cpu core when the SENDUIPI instruction is executed.
>
> Typically, the kernel shouldn't receive an interrupt with this vector.
> However, it is possible that the kernel might receive this vector.
>
> Scenario that can cause the spurious interrupt:
>
> Step	cpu 0 (receiver task)		cpu 1 (sender task)
> ----	---------------------		-------------------
> 1	task is running
> 2					executes SENDUIPI
> 3					IPI sent
> 4	context switched out
> 5	IPI delivered
> 	(kernel interrupt detected)
>
> A kernel interrupt can be detected, if a receiver task gets scheduled
> out after the SENDUIPI-based IPI was sent but before the IPI was
> delivered.

What happens if the SENDUIPI is issued when the target task is not on
the CPU? How is that any different from the above?

> The kernel doesn't need to do anything in this case other than receiving
> the interrupt and clearing the local APIC. The user interrupt is always
> stored in the receiver's UPID before the IPI is generated. When the
> receiver gets scheduled back the interrupt would be delivered based on
> its UPID.

So why on earth is that vector reaching the CPU at all?

> +#ifdef CONFIG_X86_USER_INTERRUPTS
> +	seq_printf(p, "%*s: ", prec, "UIS");

No point in printing that when user interrupts are not available/enabled
on the system.

> +	for_each_online_cpu(j)
> +		seq_printf(p, "%10u ", irq_stats(j)->uintr_spurious_count);
> +	seq_puts(p, "  User-interrupt spurious event\n");
>  #endif
>  	return 0;
>  }
> @@ -325,6 +331,33 @@ DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_kvm_posted_intr_nested_ipi)
>  }
>  #endif
>  
> +#ifdef CONFIG_X86_USER_INTERRUPTS
> +/*
> + * Handler for UINTR_NOTIFICATION_VECTOR.
> + *
> + * The notification vector is used by the cpu to detect a User Interrupt. In
> + * the typical usage, the cpu would handle this interrupt and clear the local
> + * apic.
> + *
> + * However, it is possible that the kernel might receive this vector. This can
> + * happen if the receiver thread was running when the interrupt was sent but it
> + * got scheduled out before the interrupt was delivered. The kernel doesn't
> + * need to do anything other than clearing the local APIC. A pending user
> + * interrupt is always saved in the receiver's UPID which can be referenced
> + * when the receiver gets scheduled back.
> + *
> + * If the kernel receives a storm of these, it could mean an issue with the
> + * kernel's saving and restoring of the User Interrupt MSR state; Specifically,
> + * the notification vector bits in the IA32_UINTR_MISC_MSR.

Definitely well thought out hardware that.

Thanks,

        tglx

Thomas Gleixner Sept. 25, 2021, 1:30 p.m. UTC | #2

On Fri, Sep 24 2021 at 01:07, Thomas Gleixner wrote:
> On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:
>> The kernel doesn't need to do anything in this case other than receiving
>> the interrupt and clearing the local APIC. The user interrupt is always
>> stored in the receiver's UPID before the IPI is generated. When the
>> receiver gets scheduled back the interrupt would be delivered based on
>> its UPID.
>
> So why on earth is that vector reaching the CPU at all?

Let's see how this works:

  task starts using UINTR.
    set UINTR_NOTIFACTION_VECTOR in MSR_IA32_UINTR_MISC

So from that point on the User-Interrupt Notification Identification
mechanism swallows the vector.

Where this stops working is not limited to context switch. The wreckage
comes from XSAVES:

 "After saving the user-interrupt state component, XSAVES clears
  UINV. (UINV is IA32_UINTR_MISC[39:32]; XSAVES does not modify the
  remainder of that MSR.)"

So the problem is _not_ context switch. The problem is XSAVES and that
can be issued even without a context switch.

The obvious question is: What is the value of clearing UINV?

Absolutely none. That notification vector cannot be used for anything
else, so why would the OS be interested to see it ever? This is about
user space interupts, right?

UINV should be set _ONCE_ when CR4.UINTR is enabled and not be touched
by XSAVES/XRSTORS at all. Any delivery of this vector to the OS should
be considered a hardware bug.

Thanks,

         tglx

Thomas Gleixner Sept. 26, 2021, 12:39 p.m. UTC | #3

On Sat, Sep 25 2021 at 15:30, Thomas Gleixner wrote:
> On Fri, Sep 24 2021 at 01:07, Thomas Gleixner wrote:
> The obvious question is: What is the value of clearing UINV?
>
> Absolutely none. That notification vector cannot be used for anything
> else, so why would the OS be interested to see it ever? This is about
> user space interupts, right?
>
> UINV should be set _ONCE_ when CR4.UINTR is enabled and not be touched
> by XSAVES/XRSTORS at all. Any delivery of this vector to the OS should
> be considered a hardware bug.

After decoding the documentation (sigh) and staring at the implications of
keeping UINV armed, I can see the point vs. the UPID lifetime issue when
a task gets scheduled out and migrated to a different CPU.

Not the most pretty solution, but as there needs to be some invalidation
which needs to be undone on return to user space it probably does not
matter much. 

As the whole thing is tightly coupled to XSAVES/RSTORS we need to
integrate it into that machinery and not pretend that it's something
half independent.

That means we have to handle the setting of the SN bit in UPID whenever
XSTATE is saved either during context switch, when the kernel uses the
FPU or in other places (signals, fpu_clone ...). They all end up in
save_fpregs_to_fpstate() so that might be the place to look at.

While talking about that: fpu_clone() has to invalidate the UINTR state
in the clone's xstate after the memcpy() or xsaves() operation.

Also the restore portion on the way back to user space has to be coupled
more tightly:

arch_exit_to_user_mode_prepare()
{
        ...
        if (unlikely(ti_work & _TIF_UPID))
        	uintr_restore_upid(ti_work & _TIF_NEED_FPU_LOAD);
        if (unlikely(ti_work & _TIF_NEED_FPU_LOAD))
        	switch_fpu_return();
}

upid_set_ndst(upid)
{
	apicid = __this_cpu_read(x86_cpu_to_apicid);

        if (x2apic_enabled())
            upid->ndst.x2apic = apicid;
        else
            upid->ndst.apic = apicid;
}

uintr_restore_upid(bool xrstors_pending)
{
        clear_thread_flag(TIF_UPID);
        
	// Update destination
        upid_set_ndst(upid);

        // Do we need something stronger here?
        barrier();

        clear_bit(SN, upid->status);

        // Any SENDUIPI after this point sends to this CPU
           
        // Any bit which was set in upid->pir after SN was set
        // and/or UINV was cleared by XSAVES up to the point
        // where SN was cleared above is not reflected in UIRR.

	// As this runs with interrupts disabled the current state
        // of upid->pir can be read and used for restore. A SENDUIPI
        // which sets a bit in upid->pir after that read will send
        // the notification vector which is going to be handled once
        // the task reenables interrupts on return to user space.
        // If the SENDUIPI set the bit before the read then the
        // notification vector handling will just observe the same
        // PIR state.

        // Needs to be a locked access as there might be a
        // concurrent SENDUIPI modiying it.
        pir = read_locked(upid->pir);

        if (xrstors_pending)) {
        	// Update the saved xstate for xrstors
           	current->xstate.uintr.uinv = UINTR_NOTIFICATION_VECTOR;
                current->xstate.uintr.uirr = pir;
        } else {
                // Manually restore UIRR and UINV
                wrmsrl(IA32_UINTR_RR, pir);

	        misc.val64 = 0;
                misc.uittsz = current->uintr->uittsz;
                misc.uinv = UINTR_NOTIFICATION_VECTOR;
                wrmsrl(IA32_UINTR_MISC, misc.val64);
        }
}

That's how I deciphered the documentation and I don't think this is far
from reality, but I might be wrong as usual.

Hmm?

Thanks,

        tglx

Sohil Mehta Sept. 27, 2021, 7:07 p.m. UTC | #4

On 9/26/2021 5:39 AM, Thomas Gleixner wrote:
> On Sat, Sep 25 2021 at 15:30, Thomas Gleixner wrote:
>> On Fri, Sep 24 2021 at 01:07, Thomas Gleixner wrote:
>> The obvious question is: What is the value of clearing UINV?
>>
>> Absolutely none. That notification vector cannot be used for anything
>> else, so why would the OS be interested to see it ever? This is about
>> user space interupts, right?
>>
>> UINV should be set _ONCE_ when CR4.UINTR is enabled and not be touched
>> by XSAVES/XRSTORS at all. Any delivery of this vector to the OS should
>> be considered a hardware bug.
> After decoding the documentation (sigh) and staring at the implications of
> keeping UINV armed, I can see the point vs. the UPID lifetime issue when
> a task gets scheduled out and migrated to a different CPU.


I think you got it right. Here is my understanding of this.

The User-interrupt notification processing moves all the pending 
interrupts from UPID.PIR to the UIRR.

As you mentioned below, XSTATE is saved due to several reasons which 
saves the UIRR into memory. UIRR should no longer be updated after it 
has been saved.

XSAVES clears UINV is to stop detecting additional interrupts in the 
UIRR after it has been saved.


> Not the most pretty solution, but as there needs to be some invalidation
> which needs to be undone on return to user space it probably does not
> matter much.
>
> As the whole thing is tightly coupled to XSAVES/RSTORS we need to
> integrate it into that machinery and not pretend that it's something
> half independent.


I agree. Thank you for pointing this out.

> That means we have to handle the setting of the SN bit in UPID whenever
> XSTATE is saved either during context switch, when the kernel uses the
> FPU or in other places (signals, fpu_clone ...). They all end up in
> save_fpregs_to_fpstate() so that might be the place to look at.

  Yes. The current code doesn't do this. SN bit should be set whenever 
UINTR XSTATE is saved.

> While talking about that: fpu_clone() has to invalidate the UINTR state
> in the clone's xstate after the memcpy() or xsaves() operation.
>
> Also the restore portion on the way back to user space has to be coupled
> more tightly:
>
> arch_exit_to_user_mode_prepare()
> {
>          ...
>          if (unlikely(ti_work & _TIF_UPID))
>          	uintr_restore_upid(ti_work & _TIF_NEED_FPU_LOAD);
>          if (unlikely(ti_work & _TIF_NEED_FPU_LOAD))
>          	switch_fpu_return();
> }

I am assuming _TIF_UPID would be set everytime SN is set and XSTATE is 
saved.

> upid_set_ndst(upid)
> {
> 	apicid = __this_cpu_read(x86_cpu_to_apicid);
>
>          if (x2apic_enabled())
>              upid->ndst.x2apic = apicid;
>          else
>              upid->ndst.apic = apicid;
> }
>
> uintr_restore_upid(bool xrstors_pending)
> {
>          clear_thread_flag(TIF_UPID);
>          
> 	// Update destination
>          upid_set_ndst(upid);
>
>          // Do we need something stronger here?
>          barrier();
>
>          clear_bit(SN, upid->status);
>
>          // Any SENDUIPI after this point sends to this CPU
>             
>          // Any bit which was set in upid->pir after SN was set
>          // and/or UINV was cleared by XSAVES up to the point
>          // where SN was cleared above is not reflected in UIRR.
>
> 	// As this runs with interrupts disabled the current state
>          // of upid->pir can be read and used for restore. A SENDUIPI
>          // which sets a bit in upid->pir after that read will send
>          // the notification vector which is going to be handled once
>          // the task reenables interrupts on return to user space.
>          // If the SENDUIPI set the bit before the read then the
>          // notification vector handling will just observe the same
>          // PIR state.
>
>          // Needs to be a locked access as there might be a
>          // concurrent SENDUIPI modiying it.
>          pir = read_locked(upid->pir);
>
>          if (xrstors_pending)) {
>          	// Update the saved xstate for xrstors
>             	current->xstate.uintr.uinv = UINTR_NOTIFICATION_VECTOR;

XSAVES saves the UINV value into the XSTATE buffer. I am not sure if we 
need this again. Is it because it could have been overwritten by calling 
XSAVES twice?


>                  current->xstate.uintr.uirr = pir;

I believe PIR should be ORed. There could be some bits already set in 
the UIRR.

Also, shouldn't UPID->PIR be cleared? If not, we would detect these 
interrupts all over again during the next ring transition.

>          } else {
>                  // Manually restore UIRR and UINV
>                  wrmsrl(IA32_UINTR_RR, pir);
I believe read-modify-write here as well.
> 	        misc.val64 = 0;
>                  misc.uittsz = current->uintr->uittsz;
>                  misc.uinv = UINTR_NOTIFICATION_VECTOR;
>                  wrmsrl(IA32_UINTR_MISC, misc.val64);

Thanks! This helps reduce the additional MSR read.

>          }
> }
>
> That's how I deciphered the documentation and I don't think this is far
> from reality, but I might be wrong as usual.
>
> Hmm?

Thank you for the simplification. This is very helpful.

Sohil

Sohil Mehta Sept. 27, 2021, 7:26 p.m. UTC | #5

On 9/23/2021 4:07 PM, Thomas Gleixner wrote:
> On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:
>> A user interrupt notification vector is used on the receiver's cpu to
>> identify an interrupt as a user interrupt (and not a kernel interrupt).
>> Hardware uses the same notification vector to generate an IPI from a
>> sender's cpu core when the SENDUIPI instruction is executed.
>>
>> Typically, the kernel shouldn't receive an interrupt with this vector.
>> However, it is possible that the kernel might receive this vector.
>>
>> Scenario that can cause the spurious interrupt:
>>
>> Step	cpu 0 (receiver task)		cpu 1 (sender task)
>> ----	---------------------		-------------------
>> 1	task is running
>> 2					executes SENDUIPI
>> 3					IPI sent
>> 4	context switched out
>> 5	IPI delivered
>> 	(kernel interrupt detected)
>>
>> A kernel interrupt can be detected, if a receiver task gets scheduled
>> out after the SENDUIPI-based IPI was sent but before the IPI was
>> delivered.
> What happens if the SENDUIPI is issued when the target task is not on
> the CPU? How is that any different from the above?


This didn't get covered in the other thread. Thought, I would clarify 
this a bit more.

A notification IPI is sent from the CPU that executes SENDUIPI if the 
target task is running (SN is 0).

If the target task is not running SN bit in the UPID is set, which 
prevents any notification interrupts from being generated.

However, it is possible that SN is 0 when SENDUIPI was executed which 
generates the notification IPI. But when the IPI arrives on receiver 
CPU, SN has been set, the task state has been saved and UINV has been 
cleared.

A kernel interrupt is detected in this case. I have a sample that demos 
this. I'll fix the current code and then send out the results.


>> The kernel doesn't need to do anything in this case other than receiving
>> the interrupt and clearing the local APIC. The user interrupt is always
>> stored in the receiver's UPID before the IPI is generated. When the
>> receiver gets scheduled back the interrupt would be delivered based on
>> its UPID.
> So why on earth is that vector reaching the CPU at all?

You covered this in the other thread.

>> +#ifdef CONFIG_X86_USER_INTERRUPTS
>> +	seq_printf(p, "%*s: ", prec, "UIS");
> No point in printing that when user interrupts are not available/enabled
> on the system.
>
Will fix this.

Thanks,

Sohil

Thomas Gleixner Sept. 28, 2021, 8:11 a.m. UTC | #6

Sohil,

On Mon, Sep 27 2021 at 12:07, Sohil Mehta wrote:
> On 9/26/2021 5:39 AM, Thomas Gleixner wrote:
>
> The User-interrupt notification processing moves all the pending 
> interrupts from UPID.PIR to the UIRR.

Indeed that makes sense. Should have thought about that myself.

>> Also the restore portion on the way back to user space has to be coupled
>> more tightly:
>>
>> arch_exit_to_user_mode_prepare()
>> {
>>          ...
>>          if (unlikely(ti_work & _TIF_UPID))
>>          	uintr_restore_upid(ti_work & _TIF_NEED_FPU_LOAD);
>>          if (unlikely(ti_work & _TIF_NEED_FPU_LOAD))
>>          	switch_fpu_return();
>> }
>
> I am assuming _TIF_UPID would be set everytime SN is set and XSTATE is 
> saved.

Yes.

>> upid_set_ndst(upid)
>> {
>> 	apicid = __this_cpu_read(x86_cpu_to_apicid);
>>
>>          if (x2apic_enabled())
>>              upid->ndst.x2apic = apicid;
>>          else
>>              upid->ndst.apic = apicid;
>> }
>>
>> uintr_restore_upid(bool xrstors_pending)
>> {
>>          clear_thread_flag(TIF_UPID);
>>          
>> 	// Update destination
>>          upid_set_ndst(upid);
>>
>>          // Do we need something stronger here?
>>          barrier();
>>
>>          clear_bit(SN, upid->status);
>>
>>          // Any SENDUIPI after this point sends to this CPU
>>             
>>          // Any bit which was set in upid->pir after SN was set
>>          // and/or UINV was cleared by XSAVES up to the point
>>          // where SN was cleared above is not reflected in UIRR.
>>
>> 	// As this runs with interrupts disabled the current state
>>          // of upid->pir can be read and used for restore. A SENDUIPI
>>          // which sets a bit in upid->pir after that read will send
>>          // the notification vector which is going to be handled once
>>          // the task reenables interrupts on return to user space.
>>          // If the SENDUIPI set the bit before the read then the
>>          // notification vector handling will just observe the same
>>          // PIR state.
>>
>>          // Needs to be a locked access as there might be a
>>          // concurrent SENDUIPI modiying it.
>>          pir = read_locked(upid->pir);
>>
>>          if (xrstors_pending)) {
>>          	// Update the saved xstate for xrstors
>>             	current->xstate.uintr.uinv = UINTR_NOTIFICATION_VECTOR;
>
> XSAVES saves the UINV value into the XSTATE buffer. I am not sure if we 
> need this again. Is it because it could have been overwritten by calling 
> XSAVES twice?

Yes that can happen AFAICT. I haven't done a deep analysis, but this
needs to looked at.

>>                  current->xstate.uintr.uirr = pir;
>
> I believe PIR should be ORed. There could be some bits already set in 
> the UIRR.
>
> Also, shouldn't UPID->PIR be cleared? If not, we would detect these 
> interrupts all over again during the next ring transition.

Right. So that PIR read above needs to be a locked cmpxchg().

>>          } else {
>>                  // Manually restore UIRR and UINV
>>                  wrmsrl(IA32_UINTR_RR, pir);
> I believe read-modify-write here as well.

Sigh, yes.

Thanks,

        tglx

[RFC,05/13] x86/irq: Reserve a user IPI notification vector

Commit Message

Comments

Patch