diff mbox

[v3] enable x2APIC without interrupt remapping under KVM

Message ID 20090629132926.GB20289@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Gleb Natapov June 29, 2009, 1:29 p.m. UTC
KVM would like to provide x2APIC interface to a guest without emulating
interrupt remapping device. The reason KVM prefers guest to use x2APIC
is that x2APIC interface is better virtualizable and provides better
performance than mmio xAPIC interface:

- msr exits are faster than mmio (no page table walk, emulation)
- no need to read back ICR to look at the busy bit
- one 64 bit ICR write instead of two 32 bit writes
- shared code with the Hyper-V paravirt interface
    
Included patch changes x2APIC enabling logic to enable it even if IR
initialization failed, but kernel runs under KVM and no apic id is
greater than 255 (if there is one spec requires BIOS to move to x2apic
mode before starting an OS).

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---

v2:
  Add help message to Kconfig
  Force physical mode if IR is not available.

v3:
  disable io-apic in outer function. Enable x2apic mode with masked
  io-apic

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Suresh Siddha June 29, 2009, 2:58 p.m. UTC | #1
On Mon, 2009-06-29 at 06:29 -0700, Gleb Natapov wrote:
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index d1430ef..3e5b6ea 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -260,12 +260,15 @@ config SMP
>  
>  config X86_X2APIC
>  	bool "Support x2apic"
> -	depends on X86_LOCAL_APIC && X86_64 && INTR_REMAP
> +	depends on X86_LOCAL_APIC && X86_64

Thinking more, probably we shouldn't remove this dependency. This might
encourage people (knowingly or unknowingly) to enable x2apic without
interrupt-remapping. Can we remove this? KVM mode will still work even
if we fail to enable interrupt-remapping. So this shouldn't be an issue,
correct?

Sorry I should have commented about this before.

>  
>  	ioapic_entries = alloc_ioapic_entries();
>  	if (!ioapic_entries) {
> -		pr_info("Allocate ioapic_entries failed: %d\n", ret);
> -		goto end;
> +		 pr_info("Allocate ioapic_entries failed\n");
> +		 return;


We should go to ir_failed ..

>  	}
>  
>  	ret = save_IO_APIC_setup(ioapic_entries);
>  	if (ret) {
>  		pr_info("Saving IO-APIC state failed: %d\n", ret);
> -		goto end;
> +		free_ioapic_entries(ioapic_entries);
> +		return;

same here.

In few hours I will be on 24 hour flight. So please bear with me for
delayed responses.

thanks,
suresh

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov June 29, 2009, 3:10 p.m. UTC | #2
On Mon, Jun 29, 2009 at 07:58:39AM -0700, Suresh Siddha wrote:
> On Mon, 2009-06-29 at 06:29 -0700, Gleb Natapov wrote:
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index d1430ef..3e5b6ea 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -260,12 +260,15 @@ config SMP
> >  
> >  config X86_X2APIC
> >  	bool "Support x2apic"
> > -	depends on X86_LOCAL_APIC && X86_64 && INTR_REMAP
> > +	depends on X86_LOCAL_APIC && X86_64
> 
> Thinking more, probably we shouldn't remove this dependency. This might
> encourage people (knowingly or unknowingly) to enable x2apic without
> interrupt-remapping. Can we remove this? KVM mode will still work even
> if we fail to enable interrupt-remapping. So this shouldn't be an issue,
> correct?
> 
Yes, KVM will still work. I don't have strong fillings one way or the
other, but why mandate an option that is no longer mandatory. What
others think?

> Sorry I should have commented about this before.
> 
> >  
> >  	ioapic_entries = alloc_ioapic_entries();
> >  	if (!ioapic_entries) {
> > -		pr_info("Allocate ioapic_entries failed: %d\n", ret);
> > -		goto end;
> > +		 pr_info("Allocate ioapic_entries failed\n");
> > +		 return;
> 
> 
> We should go to ir_failed ..
Why? There is not more ir_failed.

> 
> >  	}
> >  
> >  	ret = save_IO_APIC_setup(ioapic_entries);
> >  	if (ret) {
> >  		pr_info("Saving IO-APIC state failed: %d\n", ret);
> > -		goto end;
> > +		free_ioapic_entries(ioapic_entries);
> > +		return;
> 
> same here.
> 
Why? What ir_failed should do? Call free_ioapic_entries(ioapic_entries)
and exit?

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Suresh Siddha June 29, 2009, 3:15 p.m. UTC | #3
On Mon, 2009-06-29 at 08:10 -0700, Gleb Natapov wrote:
> On Mon, Jun 29, 2009 at 07:58:39AM -0700, Suresh Siddha wrote:
> > Thinking more, probably we shouldn't remove this dependency. This might
> > encourage people (knowingly or unknowingly) to enable x2apic without
> > interrupt-remapping. Can we remove this? KVM mode will still work even
> > if we fail to enable interrupt-remapping. So this shouldn't be an issue,
> > correct?
> > 
> Yes, KVM will still work. I don't have strong fillings one way or the
> other, but why mandate an option that is no longer mandatory. What
> others think?

Only under the presence of KVM we are breaking this dependency. And
typically the same kernel runs natively and under the presence of kvm
correct. So thats why we shouldn't break this dependency.


> > >  	ioapic_entries = alloc_ioapic_entries();
> > >  	if (!ioapic_entries) {
> > > -		pr_info("Allocate ioapic_entries failed: %d\n", ret);
> > > -		goto end;
> > > +		 pr_info("Allocate ioapic_entries failed\n");
> > > +		 return;
> > 
> > 
> > We should go to ir_failed ..
> Why? There is not more ir_failed.

oops. Not literally. We should goto the point where we should report the
x2apic or ir enabling failed and check for x2apic pre-enabled etc at the
end.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov June 29, 2009, 3:45 p.m. UTC | #4
On Mon, Jun 29, 2009 at 08:15:05AM -0700, Suresh Siddha wrote:
> On Mon, 2009-06-29 at 08:10 -0700, Gleb Natapov wrote:
> > On Mon, Jun 29, 2009 at 07:58:39AM -0700, Suresh Siddha wrote:
> > > Thinking more, probably we shouldn't remove this dependency. This might
> > > encourage people (knowingly or unknowingly) to enable x2apic without
> > > interrupt-remapping. Can we remove this? KVM mode will still work even
> > > if we fail to enable interrupt-remapping. So this shouldn't be an issue,
> > > correct?
> > > 
> > Yes, KVM will still work. I don't have strong fillings one way or the
> > other, but why mandate an option that is no longer mandatory. What
> > others think?
> 
> Only under the presence of KVM we are breaking this dependency. And
> typically the same kernel runs natively and under the presence of kvm
> correct. So thats why we shouldn't break this dependency.
> 
OK, I'll drop Kconfig part in the next version. Unless someone will
object till tomorrow. 

> 
> > > >  	ioapic_entries = alloc_ioapic_entries();
> > > >  	if (!ioapic_entries) {
> > > > -		pr_info("Allocate ioapic_entries failed: %d\n", ret);
> > > > -		goto end;
> > > > +		 pr_info("Allocate ioapic_entries failed\n");
> > > > +		 return;
> > > 
> > > 
> > > We should go to ir_failed ..
> > Why? There is not more ir_failed.
> 
> oops. Not literally. We should goto the point where we should report the
> x2apic or ir enabling failed and check for x2apic pre-enabled etc at the
> end.
> 
Ah, OK.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman June 30, 2009, 9:24 a.m. UTC | #5
Gleb Natapov <gleb@redhat.com> writes:

> KVM would like to provide x2APIC interface to a guest without emulating
> interrupt remapping device. The reason KVM prefers guest to use x2APIC
> is that x2APIC interface is better virtualizable and provides better
> performance than mmio xAPIC interface:
>
> - msr exits are faster than mmio (no page table walk, emulation)
> - no need to read back ICR to look at the busy bit
> - one 64 bit ICR write instead of two 32 bit writes
> - shared code with the Hyper-V paravirt interface
>     
> Included patch changes x2APIC enabling logic to enable it even if IR
> initialization failed, but kernel runs under KVM and no apic id is
> greater than 255 (if there is one spec requires BIOS to move to x2apic
> mode before starting an OS).


How common is hotplug hardware in kvm?  In particular hotplug cpus?

To support that seriously you need interrupt remapping.

Eric


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov June 30, 2009, 9:26 a.m. UTC | #6
On Tue, Jun 30, 2009 at 02:24:05AM -0700, Eric W. Biederman wrote:
> Gleb Natapov <gleb@redhat.com> writes:
> 
> > KVM would like to provide x2APIC interface to a guest without emulating
> > interrupt remapping device. The reason KVM prefers guest to use x2APIC
> > is that x2APIC interface is better virtualizable and provides better
> > performance than mmio xAPIC interface:
> >
> > - msr exits are faster than mmio (no page table walk, emulation)
> > - no need to read back ICR to look at the busy bit
> > - one 64 bit ICR write instead of two 32 bit writes
> > - shared code with the Hyper-V paravirt interface
> >     
> > Included patch changes x2APIC enabling logic to enable it even if IR
> > initialization failed, but kernel runs under KVM and no apic id is
> > greater than 255 (if there is one spec requires BIOS to move to x2apic
> > mode before starting an OS).
> 
> 
> How common is hotplug hardware in kvm?  In particular hotplug cpus?
> 
It works for Linux guests.

> To support that seriously you need interrupt remapping.
> 
Can you explain why?

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman June 30, 2009, 4:44 p.m. UTC | #7
Gleb Natapov <gleb@redhat.com> writes:

> On Tue, Jun 30, 2009 at 02:24:05AM -0700, Eric W. Biederman wrote:
>> Gleb Natapov <gleb@redhat.com> writes:
>> 
>> > KVM would like to provide x2APIC interface to a guest without emulating
>> > interrupt remapping device. The reason KVM prefers guest to use x2APIC
>> > is that x2APIC interface is better virtualizable and provides better
>> > performance than mmio xAPIC interface:
>> >
>> > - msr exits are faster than mmio (no page table walk, emulation)
>> > - no need to read back ICR to look at the busy bit
>> > - one 64 bit ICR write instead of two 32 bit writes
>> > - shared code with the Hyper-V paravirt interface
>> >     
>> > Included patch changes x2APIC enabling logic to enable it even if IR
>> > initialization failed, but kernel runs under KVM and no apic id is
>> > greater than 255 (if there is one spec requires BIOS to move to x2apic
>> > mode before starting an OS).
>> 
>> 
>> How common is hotplug hardware in kvm?  In particular hotplug cpus?
>> 
> It works for Linux guests.

>
>> To support that seriously you need interrupt remapping.
>> 
> Can you explain why?

Because ioapics don't fully function according to spec,
and the interrupt code on the hotplug path is a horrible
terrible broken hack for ioapics.

It is better than nothing but it certainly is not something
I would expect to work all of the time.

Interrupt remapping is the one case where we have hardware
that works according to spec and that works reasonably well.

Eric
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity June 30, 2009, 5:12 p.m. UTC | #8
On 06/30/2009 07:44 PM, Eric W. Biederman wrote:
>>> To support that seriously you need interrupt remapping.
>>>
>>>        
>> Can you explain why?
>>      
>
> Because ioapics don't fully function according to spec,
> and the interrupt code on the hotplug path is a horrible
> terrible broken hack for ioapics.
>
> It is better than nothing but it certainly is not something
> I would expect to work all of the time.
>    

Can you elaborate?  For kvm guests, the hardware is reasonably will 
implemented and if not we will fix it.  We need not cripple a feature 
just because some hardware is broken.
Gleb Natapov June 30, 2009, 7:03 p.m. UTC | #9
On Tue, Jun 30, 2009 at 09:44:54AM -0700, Eric W. Biederman wrote:
> Gleb Natapov <gleb@redhat.com> writes:
> 
> > On Tue, Jun 30, 2009 at 02:24:05AM -0700, Eric W. Biederman wrote:
> >> Gleb Natapov <gleb@redhat.com> writes:
> >> 
> >> > KVM would like to provide x2APIC interface to a guest without emulating
> >> > interrupt remapping device. The reason KVM prefers guest to use x2APIC
> >> > is that x2APIC interface is better virtualizable and provides better
> >> > performance than mmio xAPIC interface:
> >> >
> >> > - msr exits are faster than mmio (no page table walk, emulation)
> >> > - no need to read back ICR to look at the busy bit
> >> > - one 64 bit ICR write instead of two 32 bit writes
> >> > - shared code with the Hyper-V paravirt interface
> >> >     
> >> > Included patch changes x2APIC enabling logic to enable it even if IR
> >> > initialization failed, but kernel runs under KVM and no apic id is
> >> > greater than 255 (if there is one spec requires BIOS to move to x2apic
> >> > mode before starting an OS).
> >> 
> >> 
> >> How common is hotplug hardware in kvm?  In particular hotplug cpus?
> >> 
> > It works for Linux guests.
> 
> >
> >> To support that seriously you need interrupt remapping.
> >> 
> > Can you explain why?
> 
> Because ioapics don't fully function according to spec,
> and the interrupt code on the hotplug path is a horrible
> terrible broken hack for ioapics.
> 
Considering that interrupt remapping is fairly new feature
are you saying that hotplug (pci and cpu) on x86 is a horrible
hack on Linux?

> It is better than nothing but it certainly is not something
> I would expect to work all of the time.
> 
Because of horrible code or non-complaint ioapic implementation out
there? If later this is not a big issue for KVM.

> Interrupt remapping is the one case where we have hardware
> that works according to spec and that works reasonably well.
> 
I am sure when there was only one ioapic implementation in existence
it worked according to a spec (and if not spec was changed :)) Give
interrupt remapping some time and than we will see how above statement
holds.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman June 30, 2009, 7:08 p.m. UTC | #10
Avi Kivity <avi@redhat.com> writes:

> On 06/30/2009 07:44 PM, Eric W. Biederman wrote:
>>>> To support that seriously you need interrupt remapping.
>>>>
>>>>        
>>> Can you explain why?
>>>      
>>
>> Because ioapics don't fully function according to spec,
>> and the interrupt code on the hotplug path is a horrible
>> terrible broken hack for ioapics.
>>
>> It is better than nothing but it certainly is not something
>> I would expect to work all of the time.
>>    
>
> Can you elaborate?  For kvm guests, the hardware is reasonably will implemented
> and if not we will fix it.  We need not cripple a feature just because some
> hardware is broken.

The short version is I don't know what work arounds we will ultimately
decide to deploy to work with real hardware.  

I have been seriously contemplating causing a cpu hot-unplug request
to fail if we are in ioapic mode and we have irqs routed to the cpu
that is being unplugged.

Even with perfectly working hardware it is not possible in the general
case to migrate an ioapic irq from one cpu to another outside of an
interrupt handler without without risking dropping an interrupt.
There is no general way to know you have seen the last interrupt
floating around your system.  PCI ordering rules don't help because
the ioapics can potentially take an out of band channel.

The interrupt remapping hardware is pci and is only present on systems
with in-band interrupts, so we can flush in-flight interrupts using
pci reads.  We can safely and do reprogram the hardware to point at
different cpus at arbitrary times.

So if you possibly can please emulate hardware that is known to work
reliably for cpu hotplug.

Eric
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity June 30, 2009, 7:16 p.m. UTC | #11
On 06/30/2009 10:08 PM, Eric W. Biederman wrote:
>> Can you elaborate?  For kvm guests, the hardware is reasonably will implemented
>> and if not we will fix it.  We need not cripple a feature just because some
>> hardware is broken.
>>      
>
> The short version is I don't know what work arounds we will ultimately
> decide to deploy to work with real hardware.
>
> I have been seriously contemplating causing a cpu hot-unplug request
> to fail if we are in ioapic mode and we have irqs routed to the cpu
> that is being unplugged.
>    

Well, obviously we need to disassociate any irqs from such a cpu.  Could 
be done from the kernel or only enforced by the kernel.

> Even with perfectly working hardware it is not possible in the general
> case to migrate an ioapic irq from one cpu to another outside of an
> interrupt handler without without risking dropping an interrupt.
>    

Can't you generate a spurious interrupt immediately after the 
migration?  An extra interrupt shouldn't hurt.

> There is no general way to know you have seen the last interrupt
> floating around your system.  PCI ordering rules don't help because
> the ioapics can potentially take an out of band channel.
>    

Can you describe the problem scenario? an ioapic->lapic message 
delivered to a dead cpu?
Eric W. Biederman June 30, 2009, 7:36 p.m. UTC | #12
Avi Kivity <avi@redhat.com> writes:

> On 06/30/2009 10:08 PM, Eric W. Biederman wrote:
>>> Can you elaborate?  For kvm guests, the hardware is reasonably will implemented
>>> and if not we will fix it.  We need not cripple a feature just because some
>>> hardware is broken.
>>>      
>>
>> The short version is I don't know what work arounds we will ultimately
>> decide to deploy to work with real hardware.
>>
>> I have been seriously contemplating causing a cpu hot-unplug request
>> to fail if we are in ioapic mode and we have irqs routed to the cpu
>> that is being unplugged.
>>    
>
> Well, obviously we need to disassociate any irqs from such a cpu.  Could be done
> from the kernel or only enforced by the kernel.

Using the normal irq migration path we can move irqs off of a cpu reliably
there just aren't any progress guarantees.

>> Even with perfectly working hardware it is not possible in the general
>> case to migrate an ioapic irq from one cpu to another outside of an
>> interrupt handler without without risking dropping an interrupt.
>>    
>
> Can't you generate a spurious interrupt immediately after the migration?  An
> extra interrupt shouldn't hurt.

Nope.  The ioapics can't be told to send an interrupt.

>> There is no general way to know you have seen the last interrupt
>> floating around your system.  PCI ordering rules don't help because
>> the ioapics can potentially take an out of band channel.
>>    
>
> Can you describe the problem scenario? an ioapic->lapic message delivered to a
> dead cpu?

Dropped irqs..  Driver hangs because it is waiting for an irq.  Hardware
hangs because it is waiting for the cpu to process the irq.

Potentially we get a level triggered irq that is never acked by
the cpu that won't arm until the cpu send an ack, and we can't
send an ack from another cpu.

Eric
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman June 30, 2009, 7:43 p.m. UTC | #13
Gleb Natapov <gleb@redhat.com> writes:

> Considering that interrupt remapping is fairly new feature
> are you saying that hotplug (pci and cpu) on x86 is a horrible
> hack on Linux?

Just the current cpu hotplug path.

>> It is better than nothing but it certainly is not something
>> I would expect to work all of the time.
>> 
> Because of horrible code or non-complaint ioapic implementation out
> there? If later this is not a big issue for KVM.

Both.  And even in spec ioapics can't be made to work 100% reliably.

>> Interrupt remapping is the one case where we have hardware
>> that works according to spec and that works reasonably well.
>> 
> I am sure when there was only one ioapic implementation in existence
> it worked according to a spec (and if not spec was changed :)) Give
> interrupt remapping some time and than we will see how above statement
> holds.

What we need for cpu hotplug/unplug is on the tested path now.
That is a significant difference.

Eric
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity July 1, 2009, 8:30 a.m. UTC | #14
On 06/30/2009 10:36 PM, Eric W. Biederman wrote:
>>> The short version is I don't know what work arounds we will ultimately
>>> decide to deploy to work with real hardware.
>>>
>>> I have been seriously contemplating causing a cpu hot-unplug request
>>> to fail if we are in ioapic mode and we have irqs routed to the cpu
>>> that is being unplugged.
>>>
>>>        
>> Well, obviously we need to disassociate any irqs from such a cpu.  Could be done
>> from the kernel or only enforced by the kernel.
>>      
>
> Using the normal irq migration path we can move irqs off of a cpu reliably
> there just aren't any progress guarantees.
>    

Program the ioapic to the new cpu.  Wait a few milliseconds.  If it 
takes more than that to get an interrupt from the ioapic to the local 
apic, the machine has much bigger problems.

>>> Even with perfectly working hardware it is not possible in the general
>>> case to migrate an ioapic irq from one cpu to another outside of an
>>> interrupt handler without without risking dropping an interrupt.
>>>
>>>        
>> Can't you generate a spurious interrupt immediately after the migration?  An
>> extra interrupt shouldn't hurt.
>>      
>
> Nope.  The ioapics can't be told to send an interrupt.
>    

You can program the local apic ICR to generate an interrupt with the 
same vector.

>>> There is no general way to know you have seen the last interrupt
>>> floating around your system.  PCI ordering rules don't help because
>>> the ioapics can potentially take an out of band channel.
>>>
>>>        
>> Can you describe the problem scenario? an ioapic->lapic message delivered to a
>> dead cpu?
>>      
>
> Dropped irqs..  Driver hangs because it is waiting for an irq.  Hardware
> hangs because it is waiting for the cpu to process the irq.
>
> Potentially we get a level triggered irq that is never acked by
> the cpu that won't arm until the cpu send an ack, and we can't
> send an ack from another cpu.
>
>    

I think a spurious interrupt generated through the local apic solves 
that problem.  For level-triggered interrupts, mask them before 
offlining the cpu.
Eric W. Biederman July 1, 2009, 1:31 p.m. UTC | #15
Avi Kivity <avi@redhat.com> writes:

> On 06/30/2009 10:36 PM, Eric W. Biederman wrote:
>>>> The short version is I don't know what work arounds we will ultimately
>>>> decide to deploy to work with real hardware.
>>>>
>>>> I have been seriously contemplating causing a cpu hot-unplug request
>>>> to fail if we are in ioapic mode and we have irqs routed to the cpu
>>>> that is being unplugged.
>>>>
>>>>        
>>> Well, obviously we need to disassociate any irqs from such a cpu.  Could be done
>>> from the kernel or only enforced by the kernel.
>>>      
>>
>> Using the normal irq migration path we can move irqs off of a cpu reliably
>> there just aren't any progress guarantees.
>>    
>
> Program the ioapic to the new cpu.  Wait a few milliseconds.  If it takes more
> than that to get an interrupt from the ioapic to the local apic, the machine has
> much bigger problems.

In general you can not reprogram an ioapic safely unless the interrupt
is blocked at the source.  Which is why you either need the originating
device disabled or you have to do it in interrupt context.

I forget all of the details.  I just know in real hardware I experimented
with it a lot, and wound up hanging the ioapic state machine of both
intel and amd ioapics.

Migrating ioapic irqs in interrupt context sucks.  It just happens to
be what works reliably.

I do think the wait an eternity in computer time a short while in human
time is a valid technique when you can do nothing better.  If flushing the
interrupt was my only problem that would solve it.

>>>> Even with perfectly working hardware it is not possible in the general
>>>> case to migrate an ioapic irq from one cpu to another outside of an
>>>> interrupt handler without without risking dropping an interrupt.
>>>>
>>>>        
>>> Can't you generate a spurious interrupt immediately after the migration?  An
>>> extra interrupt shouldn't hurt.
>>>      
>>
>> Nope.  The ioapics can't be told to send an interrupt.
>>    
>
> You can program the local apic ICR to generate an interrupt with the same
> vector.

But you can not program the apic ICR to generate a level triggered
interrupt with the same vector.  So you don't get the broadcast
behavior when you ack the apic.

>>>> There is no general way to know you have seen the last interrupt
>>>> floating around your system.  PCI ordering rules don't help because
>>>> the ioapics can potentially take an out of band channel.
>>>>
>>>>        
>>> Can you describe the problem scenario? an ioapic->lapic message delivered to a
>>> dead cpu?
>>>      
>>
>> Dropped irqs..  Driver hangs because it is waiting for an irq.  Hardware
>> hangs because it is waiting for the cpu to process the irq.
>>
>> Potentially we get a level triggered irq that is never acked by
>> the cpu that won't arm until the cpu send an ack, and we can't
>> send an ack from another cpu.
>>
>>    
>
> I think a spurious interrupt generated through the local apic solves that
> problem.  For level-triggered interrupts, mask them before offlining the cpu.

So now we have a masked unacked irq.  It doesn't help in the slightest
that the cpu migration code puts irq migration last and request that we
do it all with interrupts disabled.

You might be right that by application of extreme ingenuity and completely
in spec ioapics there is a path that makes this all work.  Currently I
don't have fully in spec ioapcis, and I don't have anyone interested enough
in cpu hotplug to be willing to rip things apart until interrupt migration
is a reasonable deal on x86.   Instead all I see are patches that mitigate
the worst of the brokenness.

At the same time with the interrupt remapping hardware because it doesn't
need the irq disabled at the source when we reprogram it I can make
everything stable much more easily.

Eric
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Suresh Siddha July 1, 2009, 10:53 p.m. UTC | #16
On Tue, 2009-06-30 at 12:36 -0700, Eric W. Biederman wrote:
> Dropped irqs..  Driver hangs because it is waiting for an irq.  Hardware
> hangs because it is waiting for the cpu to process the irq.
> 
> Potentially we get a level triggered irq that is never acked by
> the cpu that won't arm until the cpu send an ack, and we can't
> send an ack from another cpu.

Eric,

Among number of experiments you have tried in the past to fix this, have
you tried the experiment of explicitly clearing the remoteIRR by
changing the trigger mode to edge and then back to level.

Is there a problem with this?

We can send a spurious IPI (after the RTE migration) with the new vector
to another cpu and handler which services the level interrupt will check
if we saw the edge mode for a level interrupt and then the handler can
explicitly restore the level trigger and reset the remote IRR by mask
+edge and unmask+level.

We might have to work with some rough edges but do you recollect any
major issue with this approach..

thanks,
suresh

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman July 2, 2009, 12:17 a.m. UTC | #17
Suresh Siddha <suresh.b.siddha@intel.com> writes:

> On Tue, 2009-06-30 at 12:36 -0700, Eric W. Biederman wrote:
>> Dropped irqs..  Driver hangs because it is waiting for an irq.  Hardware
>> hangs because it is waiting for the cpu to process the irq.
>> 
>> Potentially we get a level triggered irq that is never acked by
>> the cpu that won't arm until the cpu send an ack, and we can't
>> send an ack from another cpu.
>
> Eric,
>
> Among number of experiments you have tried in the past to fix this, have
> you tried the experiment of explicitly clearing the remoteIRR by
> changing the trigger mode to edge and then back to level.
>
> Is there a problem with this?

The problem I had wasn't remoteIRR getting stuck, but the symptoms
were largely the same.  I did try changing the trigger mode to edge
and back and that did not unstick the ioapic in all cases.

> We can send a spurious IPI (after the RTE migration) with the new vector
> to another cpu and handler which services the level interrupt will check
> if we saw the edge mode for a level interrupt and then the handler can
> explicitly restore the level trigger and reset the remote IRR by mask
> +edge and unmask+level.
>
> We might have to work with some rough edges but do you recollect any
> major issue with this approach..

This is coming up enough recently I expect it is time to cook up
a patch that does the ioapic migration in process context plus
some user space code that stress tests things.  Just so people
can repeat my experiments and see what I am trying to avoid.

Eric
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Suresh Siddha July 7, 2009, 4:14 p.m. UTC | #18
On Wed, 2009-07-01 at 17:17 -0700, Eric W. Biederman wrote:
> Suresh Siddha <suresh.b.siddha@intel.com> writes:
> > Among number of experiments you have tried in the past to fix this, have
> > you tried the experiment of explicitly clearing the remoteIRR by
> > changing the trigger mode to edge and then back to level.
> >
> > Is there a problem with this?
> 
> The problem I had wasn't remoteIRR getting stuck, but the symptoms
> were largely the same.  I did try changing the trigger mode to edge
> and back and that did not unstick the ioapic in all cases.
> 

I did some experiments locally here yesterday and on my old ICH5 based
system, I couldn't reset the remoteIRR by changing the trigger mode to
edge and then back to level.

However what worked was an explicit eoi to the io-apic using the
respective vector.

I guess we need to try both the things based on perhaps io-apic version
etc.

But what I am nervous about is, did you try both these things aswell and
still saw stuck interrupts?

I will cleanup my code and post it, so that we can try on different
systems. If this still doesn't work on certain HW platforms, atleast our
experiments of what works and what doesn't work and on what platforms
will be documented on the web.

thanks,
suresh

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman July 10, 2009, 8:29 p.m. UTC | #19
Suresh Siddha <suresh.b.siddha@intel.com> writes:

> On Wed, 2009-07-01 at 17:17 -0700, Eric W. Biederman wrote:
>> Suresh Siddha <suresh.b.siddha@intel.com> writes:
>> > Among number of experiments you have tried in the past to fix this, have
>> > you tried the experiment of explicitly clearing the remoteIRR by
>> > changing the trigger mode to edge and then back to level.
>> >
>> > Is there a problem with this?
>> 
>> The problem I had wasn't remoteIRR getting stuck, but the symptoms
>> were largely the same.  I did try changing the trigger mode to edge
>> and back and that did not unstick the ioapic in all cases.
>> 
>
> I did some experiments locally here yesterday and on my old ICH5 based
> system, I couldn't reset the remoteIRR by changing the trigger mode to
> edge and then back to level.
>
> However what worked was an explicit eoi to the io-apic using the
> respective vector.
>
> I guess we need to try both the things based on perhaps io-apic version
> etc.

In part the deep problems I ran into was something other than RemoteIRR.
If it was something as simple as the ioapic being in a documented state
I would have kept looking.  At the very least the RemoteIRR bit was not
set in some of the cases I encountered.

> But what I am nervous about is, did you try both these things aswell and
> still saw stuck interrupts?

Yes.  The work arounds were in the code so I tried them.

> I will cleanup my code and post it, so that we can try on different
> systems. If this still doesn't work on certain HW platforms, atleast our
> experiments of what works and what doesn't work and on what platforms
> will be documented on the web.

Sounds reasonable.  My apologies for the long delayed reply.  I want
to work on this but I have higher priorities.

Eric

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d1430ef..3e5b6ea 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -260,12 +260,15 @@  config SMP
 
 config X86_X2APIC
 	bool "Support x2apic"
-	depends on X86_LOCAL_APIC && X86_64 && INTR_REMAP
+	depends on X86_LOCAL_APIC && X86_64
 	---help---
 	  This enables x2apic support on CPUs that have this feature.
 
 	  This allows 32-bit apic IDs (so it can support very large systems),
-	  and accesses the local apic via MSRs not via mmio.
+	  and accesses the local apic via MSRs not via mmio. CONFIG_INTR_REMAP
+	  is required for x2apic to take affect when running on physical
+	  machine. If you intend to run the kernel as KVM guest x2apic can
+	  be used without interrupt remapping.
 
 	  If you don't know what to do here, say N.
 
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 8c7c042..f5f02c3 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -49,6 +49,7 @@ 
 #include <asm/mtrr.h>
 #include <asm/smp.h>
 #include <asm/mce.h>
+#include <asm/kvm_para.h>
 
 unsigned int num_processors;
 
@@ -1363,52 +1364,76 @@  void enable_x2apic(void)
 }
 #endif /* CONFIG_X86_X2APIC */
 
-void __init enable_IR_x2apic(void)
+int __init enable_IR(void)
 {
 #ifdef CONFIG_INTR_REMAP
 	int ret;
-	unsigned long flags;
-	struct IO_APIC_route_entry **ioapic_entries = NULL;
 
 	ret = dmar_table_init();
 	if (ret) {
 		pr_debug("dmar_table_init() failed with %d:\n", ret);
-		goto ir_failed;
+		return 0;
 	}
 
 	if (!intr_remapping_supported()) {
 		pr_debug("intr-remapping not supported\n");
-		goto ir_failed;
+		return 0;
 	}
 
-
 	if (!x2apic_preenabled && skip_ioapic_setup) {
 		pr_info("Skipped enabling intr-remap because of skipping "
 			"io-apic setup\n");
-		return;
+		return 0;
 	}
 
+	if (enable_intr_remapping(x2apic_supported()))
+		return 0;
+
+	pr_info("Enabled Interrupt-remapping\n");
+
+	return 1;
+
+#endif
+	return 0;
+}
+
+void __init enable_IR_x2apic(void)
+{
+	unsigned long flags;
+	struct IO_APIC_route_entry **ioapic_entries = NULL;
+	int ret, x2apic_enabled = 0;
+
 	ioapic_entries = alloc_ioapic_entries();
 	if (!ioapic_entries) {
-		pr_info("Allocate ioapic_entries failed: %d\n", ret);
-		goto end;
+		 pr_info("Allocate ioapic_entries failed\n");
+		 return;
 	}
 
 	ret = save_IO_APIC_setup(ioapic_entries);
 	if (ret) {
 		pr_info("Saving IO-APIC state failed: %d\n", ret);
-		goto end;
+		free_ioapic_entries(ioapic_entries);
+		return;
 	}
 
 	local_irq_save(flags);
-	mask_IO_APIC_setup(ioapic_entries);
 	mask_8259A();
+	mask_IO_APIC_setup(ioapic_entries);
 
-	ret = enable_intr_remapping(x2apic_supported());
-	if (ret)
-		goto end_restore;
+	ret = enable_IR();
+	if (!ret) {
+		/* IR is required if x2apic is enabled by BIOS even
+		 * when running in kvm since this indicates present
+		 * of APIC ID > 255 */
+	       	if (x2apic_preenabled || !kvm_para_available())
+			goto nox2apic;
+		else
+			/* without IR all CPUs can be addressed by IOAPIC/MSI
+			 * only in physical mode */
+			x2apic_phys = 1;
+	}
 
-	pr_info("Enabled Interrupt-remapping\n");
+	x2apic_enabled = 1;
 
 	if (x2apic_supported() && !x2apic_mode) {
 		x2apic_mode = 1;
@@ -1416,41 +1441,28 @@  void __init enable_IR_x2apic(void)
 		pr_info("Enabled x2apic\n");
 	}
 
-end_restore:
-	if (ret)
-		/*
-		 * IR enabling failed
-		 */
+nox2apic:
+	if (!ret) /* IR enabling failed */
 		restore_IO_APIC_setup(ioapic_entries);
-
 	unmask_8259A();
 	local_irq_restore(flags);
 
-end:
-	if (ioapic_entries)
-		free_ioapic_entries(ioapic_entries);
+	free_ioapic_entries(ioapic_entries);
 
-	if (!ret)
+	if (x2apic_enabled)
 		return;
 
-ir_failed:
-	if (x2apic_preenabled)
+	if (x2apic_preenabled) {
+#ifdef CONFIG_INTR_REMAP
 		panic("x2apic enabled by bios. But IR enabling failed");
-	else if (cpu_has_x2apic)
-		pr_info("Not enabling x2apic,Intr-remapping\n");
 #else
-	if (!cpu_has_x2apic)
-		return;
-
-	if (x2apic_preenabled)
 		panic("x2apic enabled prior OS handover,"
-		      " enable CONFIG_X86_X2APIC, CONFIG_INTR_REMAP");
+				" enable CONFIG_X86_X2APIC, CONFIG_INTR_REMAP");
 #endif
-
-	return;
+	} else if (cpu_has_x2apic)
+		pr_info("Not enabling x2apic,Intr-remapping\n");
 }
 
-
 #ifdef CONFIG_X86_64
 /*
  * Detect and enable local APICs on non-SMP boards.