[v5,11/11] s390/mm: Enable gmap huge pmd support
diff mbox

Message ID 20180706135529.88966-12-frankja@linux.ibm.com
State New
Headers show

Commit Message

Janosch Frank July 6, 2018, 1:55 p.m. UTC
From: Janosch Frank <frankja@linux.vnet.ibm.com>

Now that we have everything in place, let's allow huge (1m) pmds for
gmap linking, effectively allowing hugetlbfs backed guests. Transparent
huge pages and 2g huge pages are *not* supported through this change.

General KVM huge page support on s390 has to be enabled via the
kvm.hpage module parameter. Either nested or hpage can be enabled, as
we currently do not support vSIE for huge backed guests. Once the vSIE
support is added we will either drop the parameter or enable it as
default.

For a guest the feature has to be enabled through the new
KVM_CAP_S390_HPAGE_1M capability and the hpage module
parameter. Enabling it means that cmm can't be enabled for the vm and
disables pfmf and storage key interpretation.

This is due to the fact that in some cases, in upcoming patches, we
have to split huge pages in the guest mapping to be able to set more
granular memory protection on 4k pages. These split pages have fake
page tables that are not visible to the Linux memory management which
subsequently will not manage its PGSTEs, while the SIE will. Disabling
these features lets us manage PGSTE data in a consistent matter and
solve that problem.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 Documentation/virtual/kvm/api.txt   | 16 ++++++++++++++
 arch/s390/include/asm/mmu.h         |  2 ++
 arch/s390/include/asm/mmu_context.h |  1 +
 arch/s390/kvm/kvm-s390.c            | 42 +++++++++++++++++++++++++++++++++++--
 arch/s390/mm/gmap.c                 |  8 ++++---
 include/uapi/linux/kvm.h            |  1 +
 6 files changed, 65 insertions(+), 5 deletions(-)

Comments

David Hildenbrand July 12, 2018, 7:23 a.m. UTC | #1
I'd still vote for turning this into a "KVM: s390x: " patch or splitting
the gmap parts off.

>  8. Other capabilities.
>  ----------------------
>  
> diff --git a/arch/s390/include/asm/mmu.h b/arch/s390/include/asm/mmu.h
> index f5ff9dbad8ac..652cd658e242 100644
> --- a/arch/s390/include/asm/mmu.h
> +++ b/arch/s390/include/asm/mmu.h
> @@ -24,6 +24,8 @@ typedef struct {
>  	unsigned int uses_skeys:1;
>  	/* The mmu context uses CMM. */
>  	unsigned int uses_cmm:1;
> +	/* The gmap associated with this context uses huge pages. */
> +	unsigned int uses_gmap_hpage:1;

Didn't we discuss that something like "allow_gmap_1mb_hpage" or similar
would be more appropriate? The "uses" part can easily be wrong.

>  #include <linux/kernel.h>
> @@ -589,8 +591,8 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
>  		return -EFAULT;
>  	pmd = pmd_offset(pud, vmaddr);
>  	VM_BUG_ON(pmd_none(*pmd));
> -	/* large pmds cannot yet be handled */
> -	if (pmd_large(*pmd))
> +	/* Are we allowed to use huge pages? */
> +	if (pmd_large(*pmd) && !gmap->mm->context.uses_gmap_hpage)
>  		return -EFAULT;
>  	/* Link gmap segment table entry location to page table. */
>  	rc = radix_tree_preload(GFP_KERNEL);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index b6270a3b38e9..b955b986b341 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -949,6 +949,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_GET_MSR_FEATURES 153
>  #define KVM_CAP_HYPERV_EVENTFD 154
>  #define KVM_CAP_HYPERV_TLBFLUSH 155
> +#define KVM_CAP_S390_HPAGE_1M 156
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> 

I was wondering if we should add some safety net for gmap shadows when
hitting a huge pmd. Or when trying to create a shadow for a gmap with
huge pmds enabled (add a check in gmap_shadow()). So the GMAP kernel
interface remains consistent.


Apart from that, looks good to me!
Janosch Frank July 12, 2018, 8:09 a.m. UTC | #2
On Thu, 12 Jul 2018 09:23:01 +0200
David Hildenbrand <david@redhat.com> wrote:
> 
> >  #include <linux/kernel.h>
> > @@ -589,8 +591,8 @@ int __gmap_link(struct gmap *gmap, unsigned
> > long gaddr, unsigned long vmaddr) return -EFAULT;
> >  	pmd = pmd_offset(pud, vmaddr);
> >  	VM_BUG_ON(pmd_none(*pmd));
> > -	/* large pmds cannot yet be handled */
> > -	if (pmd_large(*pmd))
> > +	/* Are we allowed to use huge pages? */
> > +	if (pmd_large(*pmd) && !gmap->mm->context.uses_gmap_hpage)
> >  		return -EFAULT;
> >  	/* Link gmap segment table entry location to page table. */
> >  	rc = radix_tree_preload(GFP_KERNEL);
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index b6270a3b38e9..b955b986b341 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -949,6 +949,7 @@ struct kvm_ppc_resize_hpt {
> >  #define KVM_CAP_GET_MSR_FEATURES 153
> >  #define KVM_CAP_HYPERV_EVENTFD 154
> >  #define KVM_CAP_HYPERV_TLBFLUSH 155
> > +#define KVM_CAP_S390_HPAGE_1M 156
> >  
> >  #ifdef KVM_CAP_IRQ_ROUTING
> >  
> >   
> 
> I was wondering if we should add some safety net for gmap shadows when
> hitting a huge pmd. Or when trying to create a shadow for a gmap with
> huge pmds enabled (add a check in gmap_shadow()). So the GMAP kernel
> interface remains consistent.

For now we don't have the sief2 so we don't end up in gmap_shadow()
anyway. Huge l3 in small l2 is currently supported via the fake tables.

So what do you want to keep consistent?


> 
> 
> Apart from that, looks good to me!
>
David Hildenbrand July 12, 2018, 8:35 a.m. UTC | #3
On 12.07.2018 10:09, frankja wrote:
> On Thu, 12 Jul 2018 09:23:01 +0200
> David Hildenbrand <david@redhat.com> wrote:
>>
>>>  #include <linux/kernel.h>
>>> @@ -589,8 +591,8 @@ int __gmap_link(struct gmap *gmap, unsigned
>>> long gaddr, unsigned long vmaddr) return -EFAULT;
>>>  	pmd = pmd_offset(pud, vmaddr);
>>>  	VM_BUG_ON(pmd_none(*pmd));
>>> -	/* large pmds cannot yet be handled */
>>> -	if (pmd_large(*pmd))
>>> +	/* Are we allowed to use huge pages? */
>>> +	if (pmd_large(*pmd) && !gmap->mm->context.uses_gmap_hpage)
>>>  		return -EFAULT;
>>>  	/* Link gmap segment table entry location to page table. */
>>>  	rc = radix_tree_preload(GFP_KERNEL);
>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>> index b6270a3b38e9..b955b986b341 100644
>>> --- a/include/uapi/linux/kvm.h
>>> +++ b/include/uapi/linux/kvm.h
>>> @@ -949,6 +949,7 @@ struct kvm_ppc_resize_hpt {
>>>  #define KVM_CAP_GET_MSR_FEATURES 153
>>>  #define KVM_CAP_HYPERV_EVENTFD 154
>>>  #define KVM_CAP_HYPERV_TLBFLUSH 155
>>> +#define KVM_CAP_S390_HPAGE_1M 156
>>>  
>>>  #ifdef KVM_CAP_IRQ_ROUTING
>>>  
>>>   
>>
>> I was wondering if we should add some safety net for gmap shadows when
>> hitting a huge pmd. Or when trying to create a shadow for a gmap with
>> huge pmds enabled (add a check in gmap_shadow()). So the GMAP kernel
>> interface remains consistent.
> 
> For now we don't have the sief2 so we don't end up in gmap_shadow()
> anyway. Huge l3 in small l2 is currently supported via the fake tables.
> 
> So what do you want to keep consistent?

gmap is just an internal kernel interface and KVM is the only user right
now. The logic that gmap shadow does not work with hpage gmaps is right
now buried in KVM, not in the GMAP. That's why I am asking if it makes
sense to also have safety nets in the gmap shadow code.

If you added the other safety net I suggested (PMD protection with
SHADOW notifier bit set -> -EINVAL), we are maybe already on the safe
side that if gmap_shadow() would be used. We would maybe fail in a nice
fashion. But I didn't check all call paths. Prohibiting gmap_shadow()
when huge pages are allowed in the GMAP right from the start would be
the big hammer.
Janosch Frank July 12, 2018, 9:04 a.m. UTC | #4
On Thu, 12 Jul 2018 10:35:00 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 12.07.2018 10:09, frankja wrote:
> > On Thu, 12 Jul 2018 09:23:01 +0200
> > David Hildenbrand <david@redhat.com> wrote:  
> >>  
> >>>  #include <linux/kernel.h>
> >>> @@ -589,8 +591,8 @@ int __gmap_link(struct gmap *gmap, unsigned
> >>> long gaddr, unsigned long vmaddr) return -EFAULT;
> >>>  	pmd = pmd_offset(pud, vmaddr);
> >>>  	VM_BUG_ON(pmd_none(*pmd));
> >>> -	/* large pmds cannot yet be handled */
> >>> -	if (pmd_large(*pmd))
> >>> +	/* Are we allowed to use huge pages? */
> >>> +	if (pmd_large(*pmd)
> >>> && !gmap->mm->context.uses_gmap_hpage) return -EFAULT;
> >>>  	/* Link gmap segment table entry location to page table.
> >>> */ rc = radix_tree_preload(GFP_KERNEL);
> >>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>> index b6270a3b38e9..b955b986b341 100644
> >>> --- a/include/uapi/linux/kvm.h
> >>> +++ b/include/uapi/linux/kvm.h
> >>> @@ -949,6 +949,7 @@ struct kvm_ppc_resize_hpt {
> >>>  #define KVM_CAP_GET_MSR_FEATURES 153
> >>>  #define KVM_CAP_HYPERV_EVENTFD 154
> >>>  #define KVM_CAP_HYPERV_TLBFLUSH 155
> >>> +#define KVM_CAP_S390_HPAGE_1M 156
> >>>  
> >>>  #ifdef KVM_CAP_IRQ_ROUTING
> >>>  
> >>>     
> >>
> >> I was wondering if we should add some safety net for gmap shadows
> >> when hitting a huge pmd. Or when trying to create a shadow for a
> >> gmap with huge pmds enabled (add a check in gmap_shadow()). So the
> >> GMAP kernel interface remains consistent.  
> > 
> > For now we don't have the sief2 so we don't end up in gmap_shadow()
> > anyway. Huge l3 in small l2 is currently supported via the fake
> > tables.
> > 
> > So what do you want to keep consistent?  
> 
> gmap is just an internal kernel interface and KVM is the only user
> right now. The logic that gmap shadow does not work with hpage gmaps
> is right now buried in KVM, not in the GMAP. That's why I am asking
> if it makes sense to also have safety nets in the gmap shadow code.
> 
> If you added the other safety net I suggested (PMD protection with
> SHADOW notifier bit set -> -EINVAL), we are maybe already on the safe
> side that if gmap_shadow() would be used. We would maybe fail in a
> nice fashion. But I didn't check all call paths. Prohibiting
> gmap_shadow() when huge pages are allowed in the GMAP right from the
> start would be the big hammer.
> 

I'll add a BUG_ON(parent->mm->context.allow_gmap_hpage_1m); it's
removed later anyway...

For now I'll also add the notifier warning for the gmap invalidation, it
should only slow us down with thp, as hpages are otherwise never
touched after fault-in. And it will be useful for thp testing in a few
months.
David Hildenbrand July 12, 2018, 9:18 a.m. UTC | #5
On 12.07.2018 11:04, frankja wrote:
> On Thu, 12 Jul 2018 10:35:00 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 12.07.2018 10:09, frankja wrote:
>>> On Thu, 12 Jul 2018 09:23:01 +0200
>>> David Hildenbrand <david@redhat.com> wrote:  
>>>>  
>>>>>  #include <linux/kernel.h>
>>>>> @@ -589,8 +591,8 @@ int __gmap_link(struct gmap *gmap, unsigned
>>>>> long gaddr, unsigned long vmaddr) return -EFAULT;
>>>>>  	pmd = pmd_offset(pud, vmaddr);
>>>>>  	VM_BUG_ON(pmd_none(*pmd));
>>>>> -	/* large pmds cannot yet be handled */
>>>>> -	if (pmd_large(*pmd))
>>>>> +	/* Are we allowed to use huge pages? */
>>>>> +	if (pmd_large(*pmd)
>>>>> && !gmap->mm->context.uses_gmap_hpage) return -EFAULT;
>>>>>  	/* Link gmap segment table entry location to page table.
>>>>> */ rc = radix_tree_preload(GFP_KERNEL);
>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>>> index b6270a3b38e9..b955b986b341 100644
>>>>> --- a/include/uapi/linux/kvm.h
>>>>> +++ b/include/uapi/linux/kvm.h
>>>>> @@ -949,6 +949,7 @@ struct kvm_ppc_resize_hpt {
>>>>>  #define KVM_CAP_GET_MSR_FEATURES 153
>>>>>  #define KVM_CAP_HYPERV_EVENTFD 154
>>>>>  #define KVM_CAP_HYPERV_TLBFLUSH 155
>>>>> +#define KVM_CAP_S390_HPAGE_1M 156
>>>>>  
>>>>>  #ifdef KVM_CAP_IRQ_ROUTING
>>>>>  
>>>>>     
>>>>
>>>> I was wondering if we should add some safety net for gmap shadows
>>>> when hitting a huge pmd. Or when trying to create a shadow for a
>>>> gmap with huge pmds enabled (add a check in gmap_shadow()). So the
>>>> GMAP kernel interface remains consistent.  
>>>
>>> For now we don't have the sief2 so we don't end up in gmap_shadow()
>>> anyway. Huge l3 in small l2 is currently supported via the fake
>>> tables.
>>>
>>> So what do you want to keep consistent?  
>>
>> gmap is just an internal kernel interface and KVM is the only user
>> right now. The logic that gmap shadow does not work with hpage gmaps
>> is right now buried in KVM, not in the GMAP. That's why I am asking
>> if it makes sense to also have safety nets in the gmap shadow code.
>>
>> If you added the other safety net I suggested (PMD protection with
>> SHADOW notifier bit set -> -EINVAL), we are maybe already on the safe
>> side that if gmap_shadow() would be used. We would maybe fail in a
>> nice fashion. But I didn't check all call paths. Prohibiting
>> gmap_shadow() when huge pages are allowed in the GMAP right from the
>> start would be the big hammer.
>>
> 
> I'll add a BUG_ON(parent->mm->context.allow_gmap_hpage_1m); it's
> removed later anyway...
> 
> For now I'll also add the notifier warning for the gmap invalidation, it
> should only slow us down with thp, as hpages are otherwise never
> touched after fault-in. And it will be useful for thp testing in a few
> months.
> 

Sounds good to me! Looking forward to the next version of this series :)

Patch
diff mbox

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index d10944e619d3..cb8db4f9d097 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -4391,6 +4391,22 @@  all such vmexits.
 
 Do not enable KVM_FEATURE_PV_UNHALT if you disable HLT exits.
 
+7.14 KVM_CAP_S390_HPAGE_1M
+
+Architectures: s390
+Parameters: none
+Returns: 0 on success, -EINVAL if hpage module parameter was not set
+	 or cmma is enabled
+
+With this capability the KVM support for memory backing with 1m pages
+through hugetlbfs can be enabled for a VM. After the capability is
+enabled, cmma can't be enabled anymore and pfmfi and the storage key
+interpretation are disabled. If cmma has already been enabled or the
+hpage module parameter is not set to 1, -EINVAL is returned.
+
+While it is generally possible to create a huge page backed VM without
+this capability, the VM will not be able to run.
+
 8. Other capabilities.
 ----------------------
 
diff --git a/arch/s390/include/asm/mmu.h b/arch/s390/include/asm/mmu.h
index f5ff9dbad8ac..652cd658e242 100644
--- a/arch/s390/include/asm/mmu.h
+++ b/arch/s390/include/asm/mmu.h
@@ -24,6 +24,8 @@  typedef struct {
 	unsigned int uses_skeys:1;
 	/* The mmu context uses CMM. */
 	unsigned int uses_cmm:1;
+	/* The gmap associated with this context uses huge pages. */
+	unsigned int uses_gmap_hpage:1;
 } mm_context_t;
 
 #define INIT_MM_CONTEXT(name)						   \
diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h
index d16bc79c30bb..407165d805b9 100644
--- a/arch/s390/include/asm/mmu_context.h
+++ b/arch/s390/include/asm/mmu_context.h
@@ -32,6 +32,7 @@  static inline int init_new_context(struct task_struct *tsk,
 	mm->context.has_pgste = 0;
 	mm->context.uses_skeys = 0;
 	mm->context.uses_cmm = 0;
+	mm->context.uses_gmap_hpage = 0;
 #endif
 	switch (mm->context.asce_limit) {
 	case _REGION2_SIZE:
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 6acc46cc7f7f..60014ea9d3f5 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -171,7 +171,10 @@  struct kvm_s390_tod_clock_ext {
 static int nested;
 module_param(nested, int, S_IRUGO);
 MODULE_PARM_DESC(nested, "Nested virtualization support");
-
+/* allow 1m huge page guest backing, if !nested */
+static int hpage;
+module_param(hpage, int, 0444);
+MODULE_PARM_DESC(hpage, "1m huge page backing support");
 
 /*
  * For now we handle at most 16 double words as this is what the s390 base
@@ -475,6 +478,11 @@  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_S390_AIS_MIGRATION:
 		r = 1;
 		break;
+	case KVM_CAP_S390_HPAGE_1M:
+		r = 0;
+		if (hpage)
+			r = 1;
+		break;
 	case KVM_CAP_S390_MEM_OP:
 		r = MEM_OP_MAX_SIZE;
 		break;
@@ -672,6 +680,27 @@  static int kvm_vm_ioctl_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
 		VM_EVENT(kvm, 3, "ENABLE: CAP_S390_GS %s",
 			 r ? "(not available)" : "(success)");
 		break;
+	case KVM_CAP_S390_HPAGE_1M:
+		mutex_lock(&kvm->lock);
+		if (kvm->created_vcpus)
+			r = -EBUSY;
+		else if (!hpage || kvm->arch.use_cmma)
+			r = -EINVAL;
+		else {
+			r = 0;
+			kvm->mm->context.uses_gmap_hpage = 1;
+			/*
+			 * We might have to create fake 4k page
+			 * tables. To avoid that the hardware works on
+			 * stale PGSTEs, we emulate these instructions.
+			 */
+			kvm->arch.use_skf = 0;
+			kvm->arch.use_pfmfi = 0;
+		}
+		mutex_unlock(&kvm->lock);
+		VM_EVENT(kvm, 3, "ENABLE: CAP_S390_HPAGE %s",
+			 r ? "(not available)" : "(success)");
+		break;
 	case KVM_CAP_S390_USER_STSI:
 		VM_EVENT(kvm, 3, "%s", "ENABLE: CAP_S390_USER_STSI");
 		kvm->arch.user_stsi = 1;
@@ -722,7 +751,11 @@  static int kvm_s390_set_mem_control(struct kvm *kvm, struct kvm_device_attr *att
 		ret = -EBUSY;
 		VM_EVENT(kvm, 3, "%s", "ENABLE: CMMA support");
 		mutex_lock(&kvm->lock);
-		if (!kvm->created_vcpus) {
+		if (kvm->created_vcpus)
+			ret = -EBUSY;
+		else if (kvm->mm->context.uses_gmap_hpage)
+			ret = -EINVAL;
+		else {
 			kvm->arch.use_cmma = 1;
 			/* Not compatible with cmma. */
 			kvm->arch.use_pfmfi = 0;
@@ -4087,6 +4120,11 @@  static int __init kvm_s390_init(void)
 		return -ENODEV;
 	}
 
+	if (nested && hpage) {
+		pr_info("nested (vSIE) and hpage (huge page backing) can currently not be activated concurrently");
+		return -EINVAL;
+	}
+
 	for (i = 0; i < 16; i++)
 		kvm_s390_fac_base[i] |=
 			S390_lowcore.stfle_fac_list[i] & nonhyp_mask(i);
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 8992f809e3a6..03ae469fc0b6 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2,8 +2,10 @@ 
 /*
  *  KVM guest address space mapping code
  *
- *    Copyright IBM Corp. 2007, 2016
+ *    Copyright IBM Corp. 2007, 2016, 2017
  *    Author(s): Martin Schwidefsky <schwidefsky@de.ibm.com>
+ *		 David Hildenbrand <david@redhat.com>
+ *		 Janosch Frank <frankja@linux.vnet.ibm.com>
  */
 
 #include <linux/kernel.h>
@@ -589,8 +591,8 @@  int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
 		return -EFAULT;
 	pmd = pmd_offset(pud, vmaddr);
 	VM_BUG_ON(pmd_none(*pmd));
-	/* large pmds cannot yet be handled */
-	if (pmd_large(*pmd))
+	/* Are we allowed to use huge pages? */
+	if (pmd_large(*pmd) && !gmap->mm->context.uses_gmap_hpage)
 		return -EFAULT;
 	/* Link gmap segment table entry location to page table. */
 	rc = radix_tree_preload(GFP_KERNEL);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index b6270a3b38e9..b955b986b341 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -949,6 +949,7 @@  struct kvm_ppc_resize_hpt {
 #define KVM_CAP_GET_MSR_FEATURES 153
 #define KVM_CAP_HYPERV_EVENTFD 154
 #define KVM_CAP_HYPERV_TLBFLUSH 155
+#define KVM_CAP_S390_HPAGE_1M 156
 
 #ifdef KVM_CAP_IRQ_ROUTING