diff mbox series

[v1,4/4] s390/kvm: VSIE: correctly handle MVPG when in VSIE

Message ID 20201218141811.310267-5-imbrenda@linux.ibm.com (mailing list archive)
State New, archived
Headers show
Series s390/kvm: fix MVPG when in VSIE | expand

Commit Message

Claudio Imbrenda Dec. 18, 2020, 2:18 p.m. UTC
Correctly handle the MVPG instruction when issued by a VSIE guest.

Fixes: a3508fbe9dc6d ("KVM: s390: vsie: initial support for nested virtualization")
Cc: stable@vger.kernel.org
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 arch/s390/kvm/vsie.c | 73 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)

Comments

David Hildenbrand Dec. 20, 2020, 10:13 a.m. UTC | #1
On 18.12.20 15:18, Claudio Imbrenda wrote:
> Correctly handle the MVPG instruction when issued by a VSIE guest.
> 

I remember that MVPG SIE documentation was completely crazy and full of
corner cases. :)

Looking at arch/s390/kvm/intercept.c:handle_mvpg_pei(), I can spot that

1. "This interception can only happen for guests with DAT disabled ..."
2. KVM does not make use of any mvpg state inside the SCB.

Can this be observed with Linux guests?


Can I get some information on what information is stored at [0xc0, 0xd)
inside the SCB? I assume it's:

0xc0: guest physical address of source PTE
0xc8: guest physical address of target PTE


Also, which conditions have to be met such that we get a ICPT_PARTEXEC:

a) State of guest DAT (I assume off?)
b) State of PTEs: What happens if there is no PTE (I assume we need two
PTEs, otherwise no such intercept)? I assume we get an intercept if one
of both PTEs is not present or the destination PTE is protected. Correct?

So, when we (g1) get an intercept for g3, can we be sure 0xc0 and 0xc8
in the scb are both valid g1 addresses pointing at our PTE, and what do
we know about these PTEs (one not present or destination protected)?

[...]
>  /*
>   * Run the vsie on a shadow scb and a shadow gmap, without any further
>   * sanity checks, handling SIE faults.
> @@ -1063,6 +1132,10 @@ static int do_vsie_run(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page)
>  		if ((scb_s->ipa & 0xf000) != 0xf000)
>  			scb_s->ipa += 0x1000;
>  		break;
> +	case ICPT_PARTEXEC:
> +		if (scb_s->ipa == 0xb254)

Old code hat "/* MVPG only */" - why is this condition now necessary?

> +			rc = vsie_handle_mvpg(vcpu, vsie_page);
> +		break;
>  	}
>  	return rc;
>  }
>
Claudio Imbrenda Jan. 4, 2021, 3:22 p.m. UTC | #2
On Sun, 20 Dec 2020 11:13:57 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 18.12.20 15:18, Claudio Imbrenda wrote:
> > Correctly handle the MVPG instruction when issued by a VSIE guest.
> >   
> 
> I remember that MVPG SIE documentation was completely crazy and full
> of corner cases. :)

you remember correctly

> Looking at arch/s390/kvm/intercept.c:handle_mvpg_pei(), I can spot
> that
> 
> 1. "This interception can only happen for guests with DAT disabled
> ..." 2. KVM does not make use of any mvpg state inside the SCB.
> 
> Can this be observed with Linux guests?

a Linux guest will typically not run with DAT disabled

> Can I get some information on what information is stored at [0xc0,
> 0xd) inside the SCB? I assume it's:
> 
> 0xc0: guest physical address of source PTE
> 0xc8: guest physical address of target PTE

yes (plus 3 flags in the lower bits of each)

> 
> Also, which conditions have to be met such that we get a
> ICPT_PARTEXEC:
> 
> a) State of guest DAT (I assume off?)
> b) State of PTEs: What happens if there is no PTE (I assume we need
> two PTEs, otherwise no such intercept)? I assume we get an intercept
> if one of both PTEs is not present or the destination PTE is
> protected. Correct?

we get the intercept if the guest has DAT off, and at least one of the
host PTEs is invalid (and I think if the destination is valid but
protected)

> So, when we (g1) get an intercept for g3, can we be sure 0xc0 and 0xc8
> in the scb are both valid g1 addresses pointing at our PTE, and what
> do we know about these PTEs (one not present or destination
> protected)?

we only know that at least one of the following holds true:
* source invalid
* destination invalid
* destination protected

there is a bit that tells us if the destination was protected (bit 62),
but that does not exclude an invalid source

> [...]
> >  /*
> >   * Run the vsie on a shadow scb and a shadow gmap, without any
> > further
> >   * sanity checks, handling SIE faults.
> > @@ -1063,6 +1132,10 @@ static int do_vsie_run(struct kvm_vcpu
> > *vcpu, struct vsie_page *vsie_page) if ((scb_s->ipa & 0xf000) !=
> > 0xf000) scb_s->ipa += 0x1000;
> >  		break;
> > +	case ICPT_PARTEXEC:
> > +		if (scb_s->ipa == 0xb254)  
> 
> Old code hat "/* MVPG only */" - why is this condition now necessary?

old code was wrong ;)

> > +			rc = vsie_handle_mvpg(vcpu, vsie_page);
> > +		break;
> >  	}
> >  	return rc;
> >  }
> >   
> 
>
David Hildenbrand Jan. 4, 2021, 4:08 p.m. UTC | #3
On 04.01.21 16:22, Claudio Imbrenda wrote:
> On Sun, 20 Dec 2020 11:13:57 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 18.12.20 15:18, Claudio Imbrenda wrote:
>>> Correctly handle the MVPG instruction when issued by a VSIE guest.
>>>   
>>
>> I remember that MVPG SIE documentation was completely crazy and full
>> of corner cases. :)
> 
> you remember correctly
> 
>> Looking at arch/s390/kvm/intercept.c:handle_mvpg_pei(), I can spot
>> that
>>
>> 1. "This interception can only happen for guests with DAT disabled
>> ..." 2. KVM does not make use of any mvpg state inside the SCB.
>>
>> Can this be observed with Linux guests?
> 
> a Linux guest will typically not run with DAT disabled
> 
>> Can I get some information on what information is stored at [0xc0,
>> 0xd) inside the SCB? I assume it's:
>>
>> 0xc0: guest physical address of source PTE
>> 0xc8: guest physical address of target PTE
> 
> yes (plus 3 flags in the lower bits of each)

Thanks! Do the flags tell us what the deal with the PTE was? If yes,
what's the meaning of the separate flags?

I assume something like "invalid, proteced, ??"

I'm asking because I think we can handle this a little easier.

> 
>> [...]
>>>  /*
>>>   * Run the vsie on a shadow scb and a shadow gmap, without any
>>> further
>>>   * sanity checks, handling SIE faults.
>>> @@ -1063,6 +1132,10 @@ static int do_vsie_run(struct kvm_vcpu
>>> *vcpu, struct vsie_page *vsie_page) if ((scb_s->ipa & 0xf000) !=
>>> 0xf000) scb_s->ipa += 0x1000;
>>>  		break;
>>> +	case ICPT_PARTEXEC:
>>> +		if (scb_s->ipa == 0xb254)  
>>
>> Old code hat "/* MVPG only */" - why is this condition now necessary?
> 
> old code was wrong ;)


arch/s390/kvm/intercept.c:handle_partial_execution() we only seem to handle

1. MVPG
2. SIGP PEI

The latter is only relevant for external calls. IIRC, this is only active
with sigp interpretation - which is never active under vsie (ECA_SIGPI).
Claudio Imbrenda Jan. 4, 2021, 4:36 p.m. UTC | #4
On Mon, 4 Jan 2021 17:08:15 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 04.01.21 16:22, Claudio Imbrenda wrote:
> > On Sun, 20 Dec 2020 11:13:57 +0100
> > David Hildenbrand <david@redhat.com> wrote:
> >   
> >> On 18.12.20 15:18, Claudio Imbrenda wrote:  
> >>> Correctly handle the MVPG instruction when issued by a VSIE guest.
> >>>     
> >>
> >> I remember that MVPG SIE documentation was completely crazy and
> >> full of corner cases. :)  
> > 
> > you remember correctly
> >   
> >> Looking at arch/s390/kvm/intercept.c:handle_mvpg_pei(), I can spot
> >> that
> >>
> >> 1. "This interception can only happen for guests with DAT disabled
> >> ..." 2. KVM does not make use of any mvpg state inside the SCB.
> >>
> >> Can this be observed with Linux guests?  
> > 
> > a Linux guest will typically not run with DAT disabled
> >   
> >> Can I get some information on what information is stored at [0xc0,
> >> 0xd) inside the SCB? I assume it's:
> >>
> >> 0xc0: guest physical address of source PTE
> >> 0xc8: guest physical address of target PTE  
> > 
> > yes (plus 3 flags in the lower bits of each)  
> 
> Thanks! Do the flags tell us what the deal with the PTE was? If yes,
> what's the meaning of the separate flags?
> 
> I assume something like "invalid, proteced, ??"

bit 61 indicates that the address is a region or segment table entry,
when EDAT applies
bit 62 is "protected" when the protected bit is set in the segment
table entry (or region, if EDAT applies) 
bit 63 is set when the operand was translated with a real-space ASCE

but you can check if the PTE is valid just by dereferencing the
pointers...

> I'm asking because I think we can handle this a little easier.

what is your idea?

> >   
> >> [...]  
> >>>  /*
> >>>   * Run the vsie on a shadow scb and a shadow gmap, without any
> >>> further
> >>>   * sanity checks, handling SIE faults.
> >>> @@ -1063,6 +1132,10 @@ static int do_vsie_run(struct kvm_vcpu
> >>> *vcpu, struct vsie_page *vsie_page) if ((scb_s->ipa & 0xf000) !=
> >>> 0xf000) scb_s->ipa += 0x1000;
> >>>  		break;
> >>> +	case ICPT_PARTEXEC:
> >>> +		if (scb_s->ipa == 0xb254)    
> >>
> >> Old code hat "/* MVPG only */" - why is this condition now
> >> necessary?  
> > 
> > old code was wrong ;)  
> 
> 
> arch/s390/kvm/intercept.c:handle_partial_execution() we only seem to
> handle
> 
> 1. MVPG
> 2. SIGP PEI
> 
> The latter is only relevant for external calls. IIRC, this is only
> active with sigp interpretation - which is never active under vsie
> (ECA_SIGPI).

I think putting an explicit check is better than just a jump in the
dark.
David Hildenbrand Jan. 5, 2021, 10:17 a.m. UTC | #5
On 04.01.21 17:36, Claudio Imbrenda wrote:
> On Mon, 4 Jan 2021 17:08:15 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 04.01.21 16:22, Claudio Imbrenda wrote:
>>> On Sun, 20 Dec 2020 11:13:57 +0100
>>> David Hildenbrand <david@redhat.com> wrote:
>>>   
>>>> On 18.12.20 15:18, Claudio Imbrenda wrote:  
>>>>> Correctly handle the MVPG instruction when issued by a VSIE guest.
>>>>>     
>>>>
>>>> I remember that MVPG SIE documentation was completely crazy and
>>>> full of corner cases. :)  
>>>
>>> you remember correctly
>>>   
>>>> Looking at arch/s390/kvm/intercept.c:handle_mvpg_pei(), I can spot
>>>> that
>>>>
>>>> 1. "This interception can only happen for guests with DAT disabled
>>>> ..." 2. KVM does not make use of any mvpg state inside the SCB.
>>>>
>>>> Can this be observed with Linux guests?  
>>>
>>> a Linux guest will typically not run with DAT disabled
>>>   
>>>> Can I get some information on what information is stored at [0xc0,
>>>> 0xd) inside the SCB? I assume it's:
>>>>
>>>> 0xc0: guest physical address of source PTE
>>>> 0xc8: guest physical address of target PTE  
>>>
>>> yes (plus 3 flags in the lower bits of each)  
>>
>> Thanks! Do the flags tell us what the deal with the PTE was? If yes,
>> what's the meaning of the separate flags?
>>
>> I assume something like "invalid, proteced, ??"
> 
> bit 61 indicates that the address is a region or segment table entry,
> when EDAT applies
> bit 62 is "protected" when the protected bit is set in the segment
> table entry (or region, if EDAT applies) 
> bit 63 is set when the operand was translated with a real-space ASCE

Thanks!

> but you can check if the PTE is valid just by dereferencing the
> pointers...

The pgtable might already have been unshadowed and repurposed I think.
So for vSIE, the PTE content, therefore, is a little unreliable.

We could, of course, try using them to make a guess.

"Likely valid"
"Likely invalid"

A rerun of the vSIE will fixup any wrong guess.

> 
>> I'm asking because I think we can handle this a little easier.
> 
> what is your idea?

I was wondering if we can

1. avoid essentially two translations per PTE, obtaining the information
we need while tying to shadow. kvm_s390_shadow_fault() on steroids that

a) gives us the last guest pte address (tricky for segment.region table
I think ... will have to think about this)
b) the final protection

2. avoid faulting/shadowing in case we know an entry is not problematic.
E.g., no need to shadow/fault the source in case the PTE is there and
not invalid. "likely valid" case above.


The idea would be to call the new kvm_s390_shadow_fault() two times (or
only once due to our guesses) and either rerun the vsie, inject an
interrupt, or create the partial intercept.

Essentially avoiding kvm_s390_vsie_mvpg_check(). Will have to think
about this.

[...]
>>
>> arch/s390/kvm/intercept.c:handle_partial_execution() we only seem to
>> handle
>>
>> 1. MVPG
>> 2. SIGP PEI
>>
>> The latter is only relevant for external calls. IIRC, this is only
>> active with sigp interpretation - which is never active under vsie
>> (ECA_SIGPI).
> 
> I think putting an explicit check is better than just a jump in the
> dark.

Agreed, but that should then be called out somewhere why the change as
done. (e.g., separate cleanup patch)
diff mbox series

Patch

diff --git a/arch/s390/kvm/vsie.c b/arch/s390/kvm/vsie.c
index ada49583e530..6c3069868acd 100644
--- a/arch/s390/kvm/vsie.c
+++ b/arch/s390/kvm/vsie.c
@@ -977,6 +977,75 @@  static int handle_stfle(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page)
 	return 0;
 }
 
+static u64 vsie_get_register(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page, u8 reg)
+{
+	reg &= 0xf;
+	switch (reg) {
+	case 15:
+		return vsie_page->scb_s.gg15;
+	case 14:
+		return vsie_page->scb_s.gg14;
+	default:
+		return vcpu->run->s.regs.gprs[reg];
+	}
+}
+
+static int vsie_handle_mvpg(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page)
+{
+	struct kvm_s390_sie_block *scb_s = &vsie_page->scb_s;
+	unsigned long r1, r2, mask = PAGE_MASK;
+	int rc;
+
+	if (psw_bits(scb_s->gpsw).eaba == PSW_BITS_AMODE_24BIT)
+		mask = 0xfff000;
+	else if (psw_bits(scb_s->gpsw).eaba == PSW_BITS_AMODE_31BIT)
+		mask = 0x7ffff000;
+
+	r1 = vsie_get_register(vcpu, vsie_page, scb_s->ipb >> 20) & mask;
+	r2 = vsie_get_register(vcpu, vsie_page, scb_s->ipb >> 16) & mask;
+	rc = kvm_s390_vsie_mvpg_check(vcpu, r1, r2, &vsie_page->scb_o->mcic);
+
+	/*
+	 * Guest translation was not successful. The host needs to forward
+	 * the intercept to the guest and let the guest fix its page tables.
+	 * The guest needs then to retry the instruction.
+	 */
+	if (rc == -ENOENT)
+		return 1;
+
+	retry_vsie_icpt(vsie_page);
+
+	/*
+	 * Guest translation was not successful. The page tables of the guest
+	 * are broken. Try again and let the hardware deliver the fault.
+	 */
+	if (rc == -EFAULT)
+		return 0;
+
+	/*
+	 * Guest translation was successful. The host needs to fix up its
+	 * page tables and retry the instruction in the nested guest.
+	 * In case of failure, the instruction will intercept again, and
+	 * a different path will be taken.
+	 */
+	if (!rc) {
+		kvm_s390_shadow_fault(vcpu, vsie_page->gmap, r2);
+		kvm_s390_shadow_fault(vcpu, vsie_page->gmap, r1);
+		return 0;
+	}
+
+	/*
+	 * An exception happened during guest translation, it needs to be
+	 * delivered to the guest. This can happen if the host has EDAT1
+	 * enabled and the guest has not, or for other causes. The guest
+	 * needs to process the exception and return to the nested guest.
+	 */
+	if (rc > 0)
+		return kvm_s390_inject_prog_cond(vcpu, rc);
+
+	return 1;
+}
+
 /*
  * Run the vsie on a shadow scb and a shadow gmap, without any further
  * sanity checks, handling SIE faults.
@@ -1063,6 +1132,10 @@  static int do_vsie_run(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page)
 		if ((scb_s->ipa & 0xf000) != 0xf000)
 			scb_s->ipa += 0x1000;
 		break;
+	case ICPT_PARTEXEC:
+		if (scb_s->ipa == 0xb254)
+			rc = vsie_handle_mvpg(vcpu, vsie_page);
+		break;
 	}
 	return rc;
 }