diff mbox series

[v2,7/8] KVM: x86/xen: avoid blocking in hardirq context in kvm_xen_set_evtchn_fast()

Message ID 20240227115648.3104-8-dwmw2@infradead.org (mailing list archive)
State New, archived
Headers show
Series KVM: x86/xen updates | expand

Commit Message

David Woodhouse Feb. 27, 2024, 11:49 a.m. UTC
From: Paul Durrant <pdurrant@amazon.com>

As described in [1] compiling with CONFIG_PROVE_RAW_LOCK_NESTING shows that
kvm_xen_set_evtchn_fast() is blocking on pfncache locks in IRQ context.
There is only actually blocking with PREEMPT_RT because the locks will
turned into mutexes. There is no 'raw' version of rwlock_t that can be used
to avoid that, so use read_trylock() and treat failure to lock the same as
an invalid cache.

[1] https://lore.kernel.org/lkml/99771ef3a4966a01fefd3adbb2ba9c3a75f97cf2.camel@infradead.org/T/#mbd06e5a04534ce9c0ee94bd8f1e8d942b2d45bd6
Fixes: 77c9b9dea4fb ("KVM: x86/xen: Use fast path for Xen timer delivery")
Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: x86@kernel.org

v2:
 • Use read_trylock only in interrupt context, to avoid concerns about
   unfairness in the slow path.
---
 arch/x86/kvm/xen.c | 41 +++++++++++++++++++++++++++++++----------
 1 file changed, 31 insertions(+), 10 deletions(-)

Comments

Steven Rostedt Feb. 27, 2024, 2:43 p.m. UTC | #1
On Tue, 27 Feb 2024 11:49:21 +0000
David Woodhouse <dwmw2@infradead.org> wrote:

> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index c16b6d394d55..d8b5326ecebc 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -1736,9 +1736,23 @@ static int set_shinfo_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
>  	unsigned long flags;
>  	int rc = -EWOULDBLOCK;
>  
> -	read_lock_irqsave(&gpc->lock, flags);
> +	local_irq_save(flags);
> +	if (!read_trylock(&gpc->lock)) {

Note, directly disabling interrupts in PREEMPT_RT is just as bad as doing
so in non RT and the taking a mutex.

Worse yet, in PREEMPT_RT read_unlock_irqrestore() isn't going to re-enable
interrupts.

Not really sure what the best solution is here.

-- Steve


> +		/*
> +		 * When PREEMPT_RT turns locks into mutexes, rwlocks are
> +		 * turned into mutexes and most interrupts are threaded.
> +		 * But timer events may be delivered in hardirq mode due
> +		 * to using HRTIMER_MODE_ABS_HARD. So bail to the slow
> +		 * path if the trylock fails in interrupt context.
> +		 */
> +		if (in_interrupt())
> +			goto out;
> +
> +		read_lock(&gpc->lock);
> +	}
> +
>  	if (!kvm_gpc_check(gpc, PAGE_SIZE))
> -		goto out;
> +		goto out_unlock;
>  
>  	if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
>  		struct shared_info *shinfo = gpc->khva;
David Woodhouse Feb. 27, 2024, 5 p.m. UTC | #2
On 27 February 2024 14:43:26 GMT, Steven Rostedt <rostedt@goodmis.org> wrote:
>On Tue, 27 Feb 2024 11:49:21 +0000
>David Woodhouse <dwmw2@infradead.org> wrote:
>
>> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
>> index c16b6d394d55..d8b5326ecebc 100644
>> --- a/arch/x86/kvm/xen.c
>> +++ b/arch/x86/kvm/xen.c
>> @@ -1736,9 +1736,23 @@ static int set_shinfo_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
>>  	unsigned long flags;
>>  	int rc = -EWOULDBLOCK;
>>  
>> -	read_lock_irqsave(&gpc->lock, flags);
>> +	local_irq_save(flags);
>> +	if (!read_trylock(&gpc->lock)) {
>
>Note, directly disabling interrupts in PREEMPT_RT is just as bad as doing
>so in non RT and the taking a mutex.

Well, it *isn't* a mutex in non-RT. It's basically a spinlock. And even though RT turns it into a mutex this is only a trylock with a fall back to a slow path in process context, so it ends up being a very short period of irqdisable/lock – just to validate the target address of the cache, and write there. (Or fall back to that slow path, if the cache needs revalidating).

So I think this is probably OK, but...


>Worse yet, in PREEMPT_RT read_unlock_irqrestore() isn't going to re-enable
>interrupts.

... that's kind of odd. I think there's code elsewhere which assumes it will, and pairs it with an explicit local_irq_save() like the above, in a different atomic context (updating runstate time info when being descheduled). So *that* needs changing for RT?
Thomas Gleixner Feb. 27, 2024, 5:18 p.m. UTC | #3
On Tue, Feb 27 2024 at 17:00, David Woodhouse wrote:
> On 27 February 2024 14:43:26 GMT, Steven Rostedt <rostedt@goodmis.org> wrote:
>>On Tue, 27 Feb 2024 11:49:21 +0000
>>David Woodhouse <dwmw2@infradead.org> wrote:
>>
>>> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
>>> index c16b6d394d55..d8b5326ecebc 100644
>>> --- a/arch/x86/kvm/xen.c
>>> +++ b/arch/x86/kvm/xen.c
>>> @@ -1736,9 +1736,23 @@ static int set_shinfo_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
>>>  	unsigned long flags;
>>>  	int rc = -EWOULDBLOCK;
>>>  
>>> -	read_lock_irqsave(&gpc->lock, flags);
>>> +	local_irq_save(flags);
>>> +	if (!read_trylock(&gpc->lock)) {
>>
>>Note, directly disabling interrupts in PREEMPT_RT is just as bad as doing
>>so in non RT and the taking a mutex.
>
> Well, it *isn't* a mutex in non-RT. It's basically a spinlock. And
> even though RT turns it into a mutex this is only a trylock with a
> fall back to a slow path in process context, so it ends up being a
> very short period of irqdisable/lock – just to validate the target
> address of the cache, and write there. (Or fall back to that slow
> path, if the cache needs revalidating).
>
> So I think this is probably OK, but...
>
>>Worse yet, in PREEMPT_RT read_unlock_irqrestore() isn't going to re-enable
>>interrupts.
>
> ... that's kind of odd. I think there's code elsewhere which assumes
> it will, and pairs it with an explicit local_irq_save() like the
> above, in a different atomic context (updating runstate time info when
> being descheduled).

While it is legit, it simply cannot work for RT. We fixed the few places
which had it open coded (mostly for no good reason).

> So *that* needs changing for RT?

No. It's not required anywhere and we really don't change that just to
accomodate the xen shim.

I'll look at the problem at hand tomorrow.

Thanks,

        tglx
David Woodhouse Feb. 27, 2024, 5:22 p.m. UTC | #4
On 27 February 2024 17:18:58 GMT, Thomas Gleixner <tglx@linutronix.de> wrote:
>On Tue, Feb 27 2024 at 17:00, David Woodhouse wrote:
>> On 27 February 2024 14:43:26 GMT, Steven Rostedt <rostedt@goodmis.org> wrote:
>>>On Tue, 27 Feb 2024 11:49:21 +0000
>>>David Woodhouse <dwmw2@infradead.org> wrote:
>>>
>>>> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
>>>> index c16b6d394d55..d8b5326ecebc 100644
>>>> --- a/arch/x86/kvm/xen.c
>>>> +++ b/arch/x86/kvm/xen.c
>>>> @@ -1736,9 +1736,23 @@ static int set_shinfo_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
>>>>  	unsigned long flags;
>>>>  	int rc = -EWOULDBLOCK;
>>>>  
>>>> -	read_lock_irqsave(&gpc->lock, flags);
>>>> +	local_irq_save(flags);
>>>> +	if (!read_trylock(&gpc->lock)) {
>>>
>>>Note, directly disabling interrupts in PREEMPT_RT is just as bad as doing
>>>so in non RT and the taking a mutex.
>>
>> Well, it *isn't* a mutex in non-RT. It's basically a spinlock. And
>> even though RT turns it into a mutex this is only a trylock with a
>> fall back to a slow path in process context, so it ends up being a
>> very short period of irqdisable/lock – just to validate the target
>> address of the cache, and write there. (Or fall back to that slow
>> path, if the cache needs revalidating).
>>
>> So I think this is probably OK, but...
>>
>>>Worse yet, in PREEMPT_RT read_unlock_irqrestore() isn't going to re-enable
>>>interrupts.
>>
>> ... that's kind of odd. I think there's code elsewhere which assumes
>> it will, and pairs it with an explicit local_irq_save() like the
>> above, in a different atomic context (updating runstate time info when
>> being descheduled).
>
>While it is legit, it simply cannot work for RT. We fixed the few places
>which had it open coded (mostly for no good reason).
>
>> So *that* needs changing for RT?
>
>No. It's not required anywhere and we really don't change that just to
>accomodate the xen shim.

It's open-coded in the runstate update in the Xen shim, as I said. That's what needs changing, which is separate from the problem at hand in this patch.

>I'll look at the problem at hand tomorrow.

Ta.
Steven Rostedt Feb. 27, 2024, 5:25 p.m. UTC | #5
On Tue, 27 Feb 2024 17:00:42 +0000
David Woodhouse <dwmw2@infradead.org> wrote:

> On 27 February 2024 14:43:26 GMT, Steven Rostedt <rostedt@goodmis.org> wrote:
> >On Tue, 27 Feb 2024 11:49:21 +0000
> >David Woodhouse <dwmw2@infradead.org> wrote:
> >  
> >> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> >> index c16b6d394d55..d8b5326ecebc 100644
> >> --- a/arch/x86/kvm/xen.c
> >> +++ b/arch/x86/kvm/xen.c
> >> @@ -1736,9 +1736,23 @@ static int set_shinfo_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
> >>  	unsigned long flags;
> >>  	int rc = -EWOULDBLOCK;
> >>  
> >> -	read_lock_irqsave(&gpc->lock, flags);
> >> +	local_irq_save(flags);
> >> +	if (!read_trylock(&gpc->lock)) {  
> >
> >Note, directly disabling interrupts in PREEMPT_RT is just as bad as doing
> >so in non RT and the taking a mutex.  
> 
> Well, it *isn't* a mutex in non-RT. It's basically a spinlock. And even
> though RT turns it into a mutex this is only a trylock with a fall back
> to a slow path in process context, so it ends up being a very short
> period of irqdisable/lock – just to validate the target address of the
> cache, and write there. (Or fall back to that slow path, if the cache
> needs revalidating).

Yes, but if the trylock fails, you then call:

+	if (!read_trylock(&gpc->lock)) {
+		/*
+		 * When PREEMPT_RT turns locks into mutexes, rwlocks are
+		 * turned into mutexes and most interrupts are threaded.
+		 * But timer events may be delivered in hardirq mode due
+		 * to using HRTIMER_MODE_ABS_HARD. So bail to the slow
+		 * path if the trylock fails in interrupt context.
+		 */
+		if (in_interrupt())
+			goto out;
+
+		read_lock(&gpc->lock);

That takes the read_lock() with interrupts still disabled. On RT, that will
likely schedule as it had just failed a trylock.

+	}
+
> 
> So I think this is probably OK, but...
> 
> 
> >Worse yet, in PREEMPT_RT read_unlock_irqrestore() isn't going to
> >re-enable interrupts.  
> 
> ... that's kind of odd. I think there's code elsewhere which assumes it
> will, and pairs it with an explicit local_irq_save() like the above, in a
> different atomic context (updating runstate time info when being
> descheduled). So *that* needs changing for RT?


I see Thomas replied. I'll leave the rest to him.

-- Steve
David Woodhouse Feb. 27, 2024, 5:28 p.m. UTC | #6
On 27 February 2024 17:25:52 GMT, Steven Rostedt <rostedt@goodmis.org> wrote:
>On Tue, 27 Feb 2024 17:00:42 +0000
>David Woodhouse <dwmw2@infradead.org> wrote:
>
>> On 27 February 2024 14:43:26 GMT, Steven Rostedt <rostedt@goodmis.org> wrote:
>> >On Tue, 27 Feb 2024 11:49:21 +0000
>> >David Woodhouse <dwmw2@infradead.org> wrote:
>> >  
>> >> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
>> >> index c16b6d394d55..d8b5326ecebc 100644
>> >> --- a/arch/x86/kvm/xen.c
>> >> +++ b/arch/x86/kvm/xen.c
>> >> @@ -1736,9 +1736,23 @@ static int set_shinfo_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
>> >>  	unsigned long flags;
>> >>  	int rc = -EWOULDBLOCK;
>> >>  
>> >> -	read_lock_irqsave(&gpc->lock, flags);
>> >> +	local_irq_save(flags);
>> >> +	if (!read_trylock(&gpc->lock)) {  
>> >
>> >Note, directly disabling interrupts in PREEMPT_RT is just as bad as doing
>> >so in non RT and the taking a mutex.  
>> 
>> Well, it *isn't* a mutex in non-RT. It's basically a spinlock. And even
>> though RT turns it into a mutex this is only a trylock with a fall back
>> to a slow path in process context, so it ends up being a very short
>> period of irqdisable/lock – just to validate the target address of the
>> cache, and write there. (Or fall back to that slow path, if the cache
>> needs revalidating).
>
>Yes, but if the trylock fails, you then call:
>
>+	if (!read_trylock(&gpc->lock)) {
>+		/*
>+		 * When PREEMPT_RT turns locks into mutexes, rwlocks are
>+		 * turned into mutexes and most interrupts are threaded.
>+		 * But timer events may be delivered in hardirq mode due
>+		 * to using HRTIMER_MODE_ABS_HARD. So bail to the slow
>+		 * path if the trylock fails in interrupt context.
>+		 */
>+		if (in_interrupt())
>+			goto out;
>+
>+		read_lock(&gpc->lock);
>
>That takes the read_lock() with interrupts still disabled. On RT, that will
>likely schedule as it had just failed a trylock.

Oh... pants. Right, so it really does need to use read_trylock_irqsave() in the non-IRQ case, *and* it needs to either read_unlock_irqrestore() or separately unlock and local_irq_restore(), according to which way it did the locking in the first place. Ick.
David Woodhouse March 7, 2024, 9:27 a.m. UTC | #7
On Tue, 2024-02-27 at 17:22 +0000, David Woodhouse wrote:
> 
> > > > Worse yet, in PREEMPT_RT read_unlock_irqrestore() isn't going to re-enable
> > > > interrupts.
> > > 
> > > ... that's kind of odd. I think there's code elsewhere which assumes
> > > it will, and pairs it with an explicit local_irq_save() like the
> > > above, in a different atomic context (updating runstate time info when
> > > being descheduled).
> > 
> > While it is legit, it simply cannot work for RT. We fixed the few places
> > which had it open coded (mostly for no good reason).
> > 
> > > So *that* needs changing for RT?
> > 
> > No. It's not required anywhere and we really don't change that just to
> > accomodate the xen shim.
> 
> It's open-coded in the runstate update in the Xen shim, as I said.
> That's what needs changing, which is separate from the problem at
> hand in this patch.
> 
> > I'll look at the problem at hand tomorrow.
> 
> Ta.

How about this? It's fugly as hell, it puts PREEMPT_RT knowledge into
the code to make the local trylock function *conditionally* disable
interrupts. I hate it, but let's start with is it even *correct*?

This is addressing the existing runstate code I mentioned above. If it
works then it could theoretically be used in the patch in $SUBJECT too,
instead of the open-coded irqsave and trylock.

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index efca40cc8c4f..78e23f151065 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -270,6 +270,26 @@ static void kvm_xen_init_timer(struct kvm_vcpu *vcpu)
 	vcpu->arch.xen.timer.function = xen_timer_callback;
 }
 
+/*
+ * With CONFIG_PREEMPT_RT, 'irqsave' locking doesn't really disable IRQs as
+ * "all IRQs are threaded" (except the hrtimer IRQs). So, open-coding a
+ * local_irq_save() before a read_trylock() is wrong for PREEMPT_RT.
+ */
+static inline bool read_trylock_irqsave(rwlock_t *lck, unsigned long *flags)
+{
+#ifndef CONFIG_PREEMPT_RT
+	local_irq_save(*flags);
+#endif
+	if (!read_trylock(lck)) {
+#ifndef CONFIG_PREEMPT_RT
+		local_irq_restore(*flags);
+#endif
+		return false;
+	}
+
+	return true;
+}
+
 static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 {
 	struct kvm_vcpu_xen *vx = &v->arch.xen;
@@ -373,11 +393,8 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 	 * gfn_to_pfn caches that cover the region.
 	 */
 	if (atomic) {
-		local_irq_save(flags);
-		if (!read_trylock(&gpc1->lock)) {
-			local_irq_restore(flags);
+		if (!read_trylock_irqsave(&gpc1->lock, &flags))
 			return;
-		}
 	} else {
 		read_lock_irqsave(&gpc1->lock, flags);
 	}
David Woodhouse March 7, 2024, 10:02 a.m. UTC | #8
On Thu, 2024-03-07 at 09:27 +0000, David Woodhouse wrote:
> 
> How about this? It's fugly as hell, it puts PREEMPT_RT knowledge into
> the code to make the local trylock function *conditionally* disable
> interrupts. I hate it, but let's start with is it even *correct*?

Oh but wait...

The only reason we use read_lock_irqsave() in the first place is
because there was code which runs in interrupt context which takes the
same lock.

But Paul's patch is *changing* that, so the in-interrupt code will now
use read_trylock() instead, and can't deadlock. It always have to have
a fallback to a slow path anyway, for when the cache isn't valid.

So I think we can just drop the irqsave from *all* use of the locks in
question. More coffee required...
David Woodhouse March 8, 2024, 6:10 p.m. UTC | #9
On Tue, 2024-02-27 at 09:43 -0500, Steven Rostedt wrote:
> 
> > --- a/arch/x86/kvm/xen.c
> > +++ b/arch/x86/kvm/xen.c
> > @@ -1736,9 +1736,23 @@ static int set_shinfo_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
> >         unsigned long flags;
> >         int rc = -EWOULDBLOCK;
> >   
> > -       read_lock_irqsave(&gpc->lock, flags);
> > +       local_irq_save(flags);
> > +       if (!read_trylock(&gpc->lock)) {
> 
> Note, directly disabling interrupts in PREEMPT_RT is just as bad as doing
> so in non RT and the taking a mutex.
> 
> Worse yet, in PREEMPT_RT read_unlock_irqrestore() isn't going to re-enable
> interrupts.
> 
> Not really sure what the best solution is here.

I think the best solution is to realise that now Paul made the in-
interrupt code paths use read_trylock(), we don't even need to use
read_lock_irqsave() anywhere anyway, and then we don't have to *care*
that PREEMPT_RT does it differently. Going to do some testing with this
version...

From: David Woodhouse <dwmw@amazon.co.uk>
Subject: [PATCH] KVM: x86/xen: avoid blocking in hardirq context in
 kvm_xen_set_evtchn_fast()

As described in [1] compiling with CONFIG_PROVE_RAW_LOCK_NESTING shows that
kvm_xen_set_evtchn_fast() is blocking on pfncache locks in IRQ context.
There is only actually blocking with PREEMPT_RT because the locks will
turned into mutexes. There is no 'raw' version of rwlock_t that can be used
to avoid that, so use read_trylock() and treat failure to lock the same as
an invalid cache.

However, with PREEMPT_RT read_lock_irqsave() doesn't actually disable IRQs
either. So open-coding a trylock_irqsave() using local_irq_save() would be
wrong in the PREEMPT_RT case.

In fact, if kvm_xen_set_evtchn_fast() is now going to use a trylock anyway
when invoked in hardirq context, there's actually no *need* for interrupts
to be disabled by any other code path that takes the lock. So just switch
all other users to a plain read_lock() and then the different semantics
on PREEMPT_RT are not an issue.

Add a warning in __kvm_xen_has_interrupt() just to be sure that one never
does occur from interrupt context.

[1] https://lore.kernel.org/lkml/99771ef3a4966a01fefd3adbb2ba9c3a75f97cf2.camel@infradead.org/T/#mbd06e5a04534ce9c0ee94bd8f1e8d942b2d45bd6
Fixes: 77c9b9dea4fb ("KVM: x86/xen: Use fast path for Xen timer delivery")
Fixes: bbe17c625d68 ("KVM: x86/xen: Fix potential deadlock in kvm_xen_update_runstate_guest()")
Co-developed-by: Paul Durrant <pdurrant@amazon.com>
Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/xen.c | 80 ++++++++++++++++++++++++----------------------
 1 file changed, 41 insertions(+), 39 deletions(-)

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index f8113eb8d040..ab3bbe75602d 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -45,7 +45,7 @@ static int kvm_xen_shared_info_init(struct kvm *kvm)
 	int ret = 0;
 	int idx = srcu_read_lock(&kvm->srcu);
 
-	read_lock_irq(&gpc->lock);
+	read_lock(&gpc->lock);
 	while (!kvm_gpc_check(gpc, PAGE_SIZE)) {
 		read_unlock_irq(&gpc->lock);
 
@@ -53,7 +53,7 @@ static int kvm_xen_shared_info_init(struct kvm *kvm)
 		if (ret)
 			goto out;
 
-		read_lock_irq(&gpc->lock);
+		read_lock(&gpc->lock);
 	}
 
 	/*
@@ -96,7 +96,7 @@ static int kvm_xen_shared_info_init(struct kvm *kvm)
 	smp_wmb();
 
 	wc->version = wc_version + 1;
-	read_unlock_irq(&gpc->lock);
+	read_unlock(&gpc->lock);
 
 	kvm_make_all_cpus_request(kvm, KVM_REQ_MASTERCLOCK_UPDATE);
 
@@ -277,7 +277,6 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 	struct gfn_to_pfn_cache *gpc2 = &vx->runstate2_cache;
 	size_t user_len, user_len1, user_len2;
 	struct vcpu_runstate_info rs;
-	unsigned long flags;
 	size_t times_ofs;
 	uint8_t *update_bit = NULL;
 	uint64_t entry_time;
@@ -373,16 +372,13 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 	 * gfn_to_pfn caches that cover the region.
 	 */
 	if (atomic) {
-		local_irq_save(flags);
-		if (!read_trylock(&gpc1->lock)) {
-			local_irq_restore(flags);
+		if (!read_trylock(&gpc1->lock))
 			return;
-		}
 	} else {
-		read_lock_irqsave(&gpc1->lock, flags);
+		read_lock(&gpc1->lock);
 	}
 	while (!kvm_gpc_check(gpc1, user_len1)) {
-		read_unlock_irqrestore(&gpc1->lock, flags);
+		read_unlock(&gpc1->lock);
 
 		/* When invoked from kvm_sched_out() we cannot sleep */
 		if (atomic)
@@ -391,7 +387,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 		if (kvm_gpc_refresh(gpc1, user_len1))
 			return;
 
-		read_lock_irqsave(&gpc1->lock, flags);
+		read_lock(&gpc1->lock);
 	}
 
 	if (likely(!user_len2)) {
@@ -419,7 +415,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 		lock_set_subclass(&gpc1->lock.dep_map, 1, _THIS_IP_);
 		if (atomic) {
 			if (!read_trylock(&gpc2->lock)) {
-				read_unlock_irqrestore(&gpc1->lock, flags);
+				read_unlock(&gpc1->lock);
 				return;
 			}
 		} else {
@@ -428,7 +424,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 
 		if (!kvm_gpc_check(gpc2, user_len2)) {
 			read_unlock(&gpc2->lock);
-			read_unlock_irqrestore(&gpc1->lock, flags);
+			read_unlock(&gpc1->lock);
 
 			/* When invoked from kvm_sched_out() we cannot sleep */
 			if (atomic)
@@ -533,7 +529,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 	}
 
 	kvm_gpc_mark_dirty_in_slot(gpc1);
-	read_unlock_irqrestore(&gpc1->lock, flags);
+	read_unlock(&gpc1->lock);
 }
 
 void kvm_xen_update_runstate(struct kvm_vcpu *v, int state)
@@ -592,7 +588,6 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v)
 {
 	unsigned long evtchn_pending_sel = READ_ONCE(v->arch.xen.evtchn_pending_sel);
 	struct gfn_to_pfn_cache *gpc = &v->arch.xen.vcpu_info_cache;
-	unsigned long flags;
 
 	if (!evtchn_pending_sel)
 		return;
@@ -602,14 +597,14 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v)
 	 * does anyway. Page it in and retry the instruction. We're just a
 	 * little more honest about it.
 	 */
-	read_lock_irqsave(&gpc->lock, flags);
+	read_lock(&gpc->lock);
 	while (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
-		read_unlock_irqrestore(&gpc->lock, flags);
+		read_unlock(&gpc->lock);
 
 		if (kvm_gpc_refresh(gpc, sizeof(struct vcpu_info)))
 			return;
 
-		read_lock_irqsave(&gpc->lock, flags);
+		read_lock(&gpc->lock);
 	}
 
 	/* Now gpc->khva is a valid kernel address for the vcpu_info */
@@ -639,7 +634,7 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v)
 	}
 
 	kvm_gpc_mark_dirty_in_slot(gpc);
-	read_unlock_irqrestore(&gpc->lock, flags);
+	read_unlock(&gpc->lock);
 
 	/* For the per-vCPU lapic vector, deliver it as MSI. */
 	if (v->arch.xen.upcall_vector)
@@ -649,7 +644,6 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v)
 int __kvm_xen_has_interrupt(struct kvm_vcpu *v)
 {
 	struct gfn_to_pfn_cache *gpc = &v->arch.xen.vcpu_info_cache;
-	unsigned long flags;
 	u8 rc = 0;
 
 	/*
@@ -665,9 +659,10 @@ int __kvm_xen_has_interrupt(struct kvm_vcpu *v)
 	BUILD_BUG_ON(sizeof(rc) !=
 		     sizeof_field(struct compat_vcpu_info, evtchn_upcall_pending));
 
-	read_lock_irqsave(&gpc->lock, flags);
+	WARN_ON_ONCE(in_interrupt());
+	read_lock(&gpc->lock);
 	while (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
-		read_unlock_irqrestore(&gpc->lock, flags);
+		read_unlock(&gpc->lock);
 
 		/*
 		 * This function gets called from kvm_vcpu_block() after setting the
@@ -687,11 +682,11 @@ int __kvm_xen_has_interrupt(struct kvm_vcpu *v)
 			 */
 			return 0;
 		}
-		read_lock_irqsave(&gpc->lock, flags);
+		read_lock(&gpc->lock);
 	}
 
 	rc = ((struct vcpu_info *)gpc->khva)->evtchn_upcall_pending;
-	read_unlock_irqrestore(&gpc->lock, flags);
+	read_unlock(&gpc->lock);
 	return rc;
 }
 
@@ -1382,12 +1377,11 @@ static bool wait_pending_event(struct kvm_vcpu *vcpu, int nr_ports,
 	struct kvm *kvm = vcpu->kvm;
 	struct gfn_to_pfn_cache *gpc = &kvm->arch.xen.shinfo_cache;
 	unsigned long *pending_bits;
-	unsigned long flags;
 	bool ret = true;
 	int idx, i;
 
 	idx = srcu_read_lock(&kvm->srcu);
-	read_lock_irqsave(&gpc->lock, flags);
+	read_lock(&gpc->lock);
 	if (!kvm_gpc_check(gpc, PAGE_SIZE))
 		goto out_rcu;
 
@@ -1408,7 +1402,7 @@ static bool wait_pending_event(struct kvm_vcpu *vcpu, int nr_ports,
 	}
 
  out_rcu:
-	read_unlock_irqrestore(&gpc->lock, flags);
+	read_unlock(&gpc->lock);
 	srcu_read_unlock(&kvm->srcu, idx);
 
 	return ret;
@@ -1730,12 +1724,17 @@ static int set_shinfo_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
 	struct kvm *kvm = vcpu->kvm;
 	struct gfn_to_pfn_cache *gpc = &kvm->arch.xen.shinfo_cache;
 	unsigned long *pending_bits, *mask_bits;
-	unsigned long flags;
 	int rc = -EWOULDBLOCK;
 
-	read_lock_irqsave(&gpc->lock, flags);
+	if (in_interrupt()) {
+		if (!read_trylock(&gpc->lock))
+			goto out;
+	} else {
+		read_lock(&gpc->lock);
+	}
+
 	if (!kvm_gpc_check(gpc, PAGE_SIZE))
-		goto out;
+		goto out_unlock;
 
 	if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
 		struct shared_info *shinfo = gpc->khva;
@@ -1758,8 +1757,9 @@ static int set_shinfo_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
 		rc = 1; /* It is newly raised */
 	}
 
+ out_unlock:
+	read_unlock(&gpc->lock);
  out:
-	read_unlock_irqrestore(&gpc->lock, flags);
 	return rc;
 }
 
@@ -1767,23 +1767,23 @@ static bool set_vcpu_info_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
 {
 	struct kvm *kvm = vcpu->kvm;
 	struct gfn_to_pfn_cache *gpc = &vcpu->arch.xen.vcpu_info_cache;
-	unsigned long flags;
 	bool kick_vcpu = false;
+	bool locked;
 
-	read_lock_irqsave(&gpc->lock, flags);
+	locked = read_trylock(&gpc->lock);
 
 	/*
 	 * Try to deliver the event directly to the vcpu_info. If successful and
 	 * the guest is using upcall_vector delivery, send the MSI.
-	 * If the pfncache is invalid, set the shadow. In this case, or if the
-	 * guest is using another form of event delivery, the vCPU must be
-	 * kicked to complete the delivery.
+	 * If the pfncache lock is contended or the cache is invalid, set the
+	 * shadow. In this case, or if the guest is using another form of event
+	 * delivery, the vCPU must be kicked to complete the delivery.
 	 */
 	if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
 		struct vcpu_info *vcpu_info = gpc->khva;
 		int port_word_bit = port / 64;
 
-		if (!kvm_gpc_check(gpc, sizeof(*vcpu_info))) {
+		if ((!locked || !kvm_gpc_check(gpc, sizeof(*vcpu_info)))) {
 			if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
 				kick_vcpu = true;
 			goto out;
@@ -1797,7 +1797,7 @@ static bool set_vcpu_info_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
 		struct compat_vcpu_info *vcpu_info = gpc->khva;
 		int port_word_bit = port / 32;
 
-		if (!kvm_gpc_check(gpc, sizeof(*vcpu_info))) {
+		if ((!locked || !kvm_gpc_check(gpc, sizeof(*vcpu_info)))) {
 			if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
 				kick_vcpu = true;
 			goto out;
@@ -1816,7 +1816,9 @@ static bool set_vcpu_info_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
 	}
 
  out:
-	read_unlock_irqrestore(&gpc->lock, flags);
+	if (locked)
+		read_unlock(&gpc->lock);
+
 	return kick_vcpu;
 }
diff mbox series

Patch

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index c16b6d394d55..d8b5326ecebc 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -1736,9 +1736,23 @@  static int set_shinfo_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
 	unsigned long flags;
 	int rc = -EWOULDBLOCK;
 
-	read_lock_irqsave(&gpc->lock, flags);
+	local_irq_save(flags);
+	if (!read_trylock(&gpc->lock)) {
+		/*
+		 * When PREEMPT_RT turns locks into mutexes, rwlocks are
+		 * turned into mutexes and most interrupts are threaded.
+		 * But timer events may be delivered in hardirq mode due
+		 * to using HRTIMER_MODE_ABS_HARD. So bail to the slow
+		 * path if the trylock fails in interrupt context.
+		 */
+		if (in_interrupt())
+			goto out;
+
+		read_lock(&gpc->lock);
+	}
+
 	if (!kvm_gpc_check(gpc, PAGE_SIZE))
-		goto out;
+		goto out_unlock;
 
 	if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
 		struct shared_info *shinfo = gpc->khva;
@@ -1761,8 +1775,10 @@  static int set_shinfo_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
 		rc = 1; /* It is newly raised */
 	}
 
+ out_unlock:
+	read_unlock(&gpc->lock);
  out:
-	read_unlock_irqrestore(&gpc->lock, flags);
+	local_irq_restore(flags);
 	return rc;
 }
 
@@ -1772,21 +1788,23 @@  static bool set_vcpu_info_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
 	struct gfn_to_pfn_cache *gpc = &vcpu->arch.xen.vcpu_info_cache;
 	unsigned long flags;
 	bool kick_vcpu = false;
+	bool locked;
 
-	read_lock_irqsave(&gpc->lock, flags);
+	local_irq_save(flags);
+	locked = read_trylock(&gpc->lock);
 
 	/*
 	 * Try to deliver the event directly to the vcpu_info. If successful and
 	 * the guest is using upcall_vector delivery, send the MSI.
-	 * If the pfncache is invalid, set the shadow. In this case, or if the
-	 * guest is using another form of event delivery, the vCPU must be
-	 * kicked to complete the delivery.
+	 * If the pfncache lock is contended or the cache is invalid, set the
+	 * shadow. In this case, or if the guest is using another form of event
+	 * delivery, the vCPU must be kicked to complete the delivery.
 	 */
 	if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
 		struct vcpu_info *vcpu_info = gpc->khva;
 		int port_word_bit = port / 64;
 
-		if (!kvm_gpc_check(gpc, sizeof(*vcpu_info))) {
+		if ((!locked || !kvm_gpc_check(gpc, sizeof(*vcpu_info)))) {
 			if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
 				kick_vcpu = true;
 			goto out;
@@ -1800,7 +1818,7 @@  static bool set_vcpu_info_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
 		struct compat_vcpu_info *vcpu_info = gpc->khva;
 		int port_word_bit = port / 32;
 
-		if (!kvm_gpc_check(gpc, sizeof(*vcpu_info))) {
+		if ((!locked || !kvm_gpc_check(gpc, sizeof(*vcpu_info)))) {
 			if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
 				kick_vcpu = true;
 			goto out;
@@ -1819,7 +1837,10 @@  static bool set_vcpu_info_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
 	}
 
  out:
-	read_unlock_irqrestore(&gpc->lock, flags);
+	if (locked)
+		read_unlock(&gpc->lock);
+
+	local_irq_restore(flags);
 	return kick_vcpu;
 }