[13/15] arm64: kvm: Rewrite fake pgd handling

Message ID	1442331684-28818-14-git-send-email-suzuki.poulose@arm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: "Suzuki K. Poulose" <suzuki.poulose@arm.com> To: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org, Catalin.Marinas@arm.com, Will.Deacon@arm.com, Mark.Rutland@arm.com, Marc.Zyngier@arm.com, kvmarm@lists.cs.columbia.edu, kvm@vger.kernel.org, ard.biesheuvel@linaro.org, suzuki.poulose@arm.com, christoffer.dall@linaro.org Subject: [PATCH 13/15] arm64: kvm: Rewrite fake pgd handling Date: Tue, 15 Sep 2015 16:41:22 +0100 Message-Id: <1442331684-28818-14-git-send-email-suzuki.poulose@arm.com> In-Reply-To: <1442331684-28818-1-git-send-email-suzuki.poulose@arm.com> References: <1442331684-28818-1-git-send-email-suzuki.poulose@arm.com> Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable Sender: kvm-owner@vger.kernel.org Precedence: bulk

Suzuki K Poulose Sept. 15, 2015, 3:41 p.m. UTC

From: "Suzuki K. Poulose" <suzuki.poulose@arm.com>

The existing fake pgd handling code assumes that the stage-2 entry
level can only be one level down that of the host, which may not be
true always(e.g, with the introduction of 16k pagesize).

e.g.
With 16k page size and 48bit VA and 40bit IPA we have the following
split for page table levels:

level:  0       1         2         3
bits : [47] [46 - 36] [35 - 25] [24 - 14] [13 - 0]
         ^       ^     ^
         |       |     |
   host entry    |     x---- stage-2 entry
                 |
        IPA -----x

The stage-2 entry level is 2, due to the concatenation of 16tables
at level 2(mandated by the hardware). So, we need to fake two levels
to actually reach the hyp page table. This case cannot be handled
with the existing code, as, all we know about is KVM_PREALLOC_LEVEL
which kind of stands for two different pieces of information.

1) Whether we have fake page table entry levels.
2) The entry level of stage-2 translation.

We loose the information about the number of fake levels that
we may have to use. Also, KVM_PREALLOC_LEVEL computation itself
is wrong, as we assume the hw entry level is always 1 level down
from the host.

This patch introduces two seperate indicators :
1) Accurate entry level for stage-2 translation - HYP_PGTABLE_ENTRY_LEVEL -
   using the new helpers.
2) Number of levels of fake pagetable entries. (KVM_FAKE_PGTABLE_LEVELS)

The following conditions hold true for all cases(with 40bit IPA)
1) The stage-2 entry level <= 2
2) Number of fake page-table entries is in the inclusive range [0, 2].

Cc: kvmarm@lists.cs.columbia.edu
Cc: christoffer.dall@linaro.org
Cc: Marc.Zyngier@arm.com
Signed-off-by: Suzuki K. Poulose <suzuki.poulose@arm.com>
---
 arch/arm64/include/asm/kvm_mmu.h |  114 ++++++++++++++++++++------------------
 1 file changed, 61 insertions(+), 53 deletions(-)

Marc Zyngier Oct. 7, 2015, 11:13 a.m. UTC | #1

On 15/09/15 16:41, Suzuki K. Poulose wrote:
> From: "Suzuki K. Poulose" <suzuki.poulose@arm.com>
> 
> The existing fake pgd handling code assumes that the stage-2 entry
> level can only be one level down that of the host, which may not be
> true always(e.g, with the introduction of 16k pagesize).
> 
> e.g.
> With 16k page size and 48bit VA and 40bit IPA we have the following
> split for page table levels:
> 
> level:  0       1         2         3
> bits : [47] [46 - 36] [35 - 25] [24 - 14] [13 - 0]
>          ^       ^     ^
>          |       |     |
>    host entry    |     x---- stage-2 entry
>                  |
>         IPA -----x
> 
> The stage-2 entry level is 2, due to the concatenation of 16tables
> at level 2(mandated by the hardware). So, we need to fake two levels
> to actually reach the hyp page table. This case cannot be handled

Nit: this is the stage-2 PT, not HYP.

> with the existing code, as, all we know about is KVM_PREALLOC_LEVEL
> which kind of stands for two different pieces of information.
> 
> 1) Whether we have fake page table entry levels.
> 2) The entry level of stage-2 translation.
> 
> We loose the information about the number of fake levels that
> we may have to use. Also, KVM_PREALLOC_LEVEL computation itself
> is wrong, as we assume the hw entry level is always 1 level down
> from the host.
> 
> This patch introduces two seperate indicators :

Nit: "separate".

> 1) Accurate entry level for stage-2 translation - HYP_PGTABLE_ENTRY_LEVEL -
>    using the new helpers.

Same confusion here. HYP has its own set of page tables, and this
definitely is S2, not HYP. Please update this symbol (and all the
similar ones) so that it is not confusing.

> 2) Number of levels of fake pagetable entries. (KVM_FAKE_PGTABLE_LEVELS)
> 
> The following conditions hold true for all cases(with 40bit IPA)
> 1) The stage-2 entry level <= 2
> 2) Number of fake page-table entries is in the inclusive range [0, 2].
> 
> Cc: kvmarm@lists.cs.columbia.edu
> Cc: christoffer.dall@linaro.org
> Cc: Marc.Zyngier@arm.com
> Signed-off-by: Suzuki K. Poulose <suzuki.poulose@arm.com>
> ---
>  arch/arm64/include/asm/kvm_mmu.h |  114 ++++++++++++++++++++------------------
>  1 file changed, 61 insertions(+), 53 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index 2567fe8..72cfd9e 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -41,18 +41,6 @@
>   */
>  #define TRAMPOLINE_VA		(HYP_PAGE_OFFSET_MASK & PAGE_MASK)
>  
> -/*
> - * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation
> - * levels in addition to the PGD and potentially the PUD which are
> - * pre-allocated (we pre-allocate the fake PGD and the PUD when the Stage-2
> - * tables use one level of tables less than the kernel.
> - */
> -#ifdef CONFIG_ARM64_64K_PAGES
> -#define KVM_MMU_CACHE_MIN_PAGES	1
> -#else
> -#define KVM_MMU_CACHE_MIN_PAGES	2
> -#endif
> -
>  #ifdef __ASSEMBLY__
>  
>  /*
> @@ -80,6 +68,26 @@
>  #define KVM_PHYS_SIZE	(1UL << KVM_PHYS_SHIFT)
>  #define KVM_PHYS_MASK	(KVM_PHYS_SIZE - 1UL)
>  
> +/*
> + * At stage-2 entry level, upto 16 tables can be concatenated and
> + * the hardware expects us to use concatenation, whenever possible.
> + * So, number of page table levels for KVM_PHYS_SHIFT is always
> + * the number of normal page table levels for (KVM_PHYS_SHIFT - 4).
> + */
> +#define HYP_PGTABLE_LEVELS	ARM64_HW_PGTABLE_LEVELS(KVM_PHYS_SHIFT - 4)
> +/* Number of bits normally addressed by HYP_PGTABLE_LEVELS */
> +#define HYP_PGTABLE_SHIFT	ARM64_HW_PGTABLE_LEVEL_SHIFT(HYP_PGTABLE_LEVELS + 1)

Why +1? I don't understand where that is coming from... which makes the
rest of the patch fairly opaque to me...

	M.

Suzuki K Poulose Oct. 7, 2015, 12:21 p.m. UTC | #2

On 07/10/15 12:13, Marc Zyngier wrote:
> On 15/09/15 16:41, Suzuki K. Poulose wrote:
>> From: "Suzuki K. Poulose" <suzuki.poulose@arm.com>
>>
>> The existing fake pgd handling code assumes that the stage-2 entry
>> level can only be one level down that of the host, which may not be
>> true always(e.g, with the introduction of 16k pagesize).
>>
>> e.g.
>> With 16k page size and 48bit VA and 40bit IPA we have the following
>> split for page table levels:
>>
>> level:  0       1         2         3
>> bits : [47] [46 - 36] [35 - 25] [24 - 14] [13 - 0]
>>           ^       ^     ^
>>           |       |     |
>>     host entry    |     x---- stage-2 entry
>>                   |
>>          IPA -----x
>>
>> The stage-2 entry level is 2, due to the concatenation of 16tables
>> at level 2(mandated by the hardware). So, we need to fake two levels
>> to actually reach the hyp page table. This case cannot be handled
>
> Nit: this is the stage-2 PT, not HYP.
>
>> with the existing code, as, all we know about is KVM_PREALLOC_LEVEL
>> which kind of stands for two different pieces of information.
>>
>> 1) Whether we have fake page table entry levels.
>> 2) The entry level of stage-2 translation.
>>
>> We loose the information about the number of fake levels that
>> we may have to use. Also, KVM_PREALLOC_LEVEL computation itself
>> is wrong, as we assume the hw entry level is always 1 level down
>> from the host.
>>
>> This patch introduces two seperate indicators :
>
> Nit: "separate".
>
>> 1) Accurate entry level for stage-2 translation - HYP_PGTABLE_ENTRY_LEVEL -
>>     using the new helpers.
>
> Same confusion here. HYP has its own set of page tables, and this
> definitely is S2, not HYP. Please update this symbol (and all the
> similar ones) so that it is not confusing.
>

Sure, I will use S2 everywhere.

>> 2) Number of levels of fake pagetable entries. (KVM_FAKE_PGTABLE_LEVELS)
>>
>> The following conditions hold true for all cases(with 40bit IPA)
>> 1) The stage-2 entry level <= 2
>> 2) Number of fake page-table entries is in the inclusive range [0, 2].
>>
>> Cc: kvmarm@lists.cs.columbia.edu
>> Cc: christoffer.dall@linaro.org
>> Cc: Marc.Zyngier@arm.com
>> Signed-off-by: Suzuki K. Poulose <suzuki.poulose@arm.com>
>> ---
>>   arch/arm64/include/asm/kvm_mmu.h |  114 ++++++++++++++++++++------------------
>>   1 file changed, 61 insertions(+), 53 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>> index 2567fe8..72cfd9e 100644
>> --- a/arch/arm64/include/asm/kvm_mmu.h
>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>> @@ -41,18 +41,6 @@
>>    */
>>   #define TRAMPOLINE_VA		(HYP_PAGE_OFFSET_MASK & PAGE_MASK)
>>
>> -/*
>> - * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation
>> - * levels in addition to the PGD and potentially the PUD which are
>> - * pre-allocated (we pre-allocate the fake PGD and the PUD when the Stage-2
>> - * tables use one level of tables less than the kernel.
>> - */
>> -#ifdef CONFIG_ARM64_64K_PAGES
>> -#define KVM_MMU_CACHE_MIN_PAGES	1
>> -#else
>> -#define KVM_MMU_CACHE_MIN_PAGES	2
>> -#endif
>> -
>>   #ifdef __ASSEMBLY__
>>
>>   /*
>> @@ -80,6 +68,26 @@
>>   #define KVM_PHYS_SIZE	(1UL << KVM_PHYS_SHIFT)
>>   #define KVM_PHYS_MASK	(KVM_PHYS_SIZE - 1UL)
>>
>> +/*
>> + * At stage-2 entry level, upto 16 tables can be concatenated and
>> + * the hardware expects us to use concatenation, whenever possible.
>> + * So, number of page table levels for KVM_PHYS_SHIFT is always
>> + * the number of normal page table levels for (KVM_PHYS_SHIFT - 4).
>> + */
>> +#define HYP_PGTABLE_LEVELS	ARM64_HW_PGTABLE_LEVELS(KVM_PHYS_SHIFT - 4)
>> +/* Number of bits normally addressed by HYP_PGTABLE_LEVELS */
>> +#define HYP_PGTABLE_SHIFT	ARM64_HW_PGTABLE_LEVEL_SHIFT(HYP_PGTABLE_LEVELS + 1)
>
> Why +1? I don't understand where that is coming from... which makes the
> rest of the patch fairly opaque to me...

Sorry for the confusion in the numbering of levels and the lack of comments.

Taking the above example in the description, with 16K.


ARM ARM entry

no. of
levels     4     3         2         1         0

vabits : [47] [46 - 36] [35 - 25] [24 - 14] [13 - 0]
             ^       ^    ^
             |       |    |
       host entry    |    x---- stage-2 entry
                     |    |
            IPA -----x    x----- HYP_PGTABLE_SHIFT


1) ARM64_HW_PGTABLE_LEVEL_SHIFT(x) gives the size a level 'x' entry can map.

e.g, PTE_SHIFT => ARM64_HW_PGTABLE_LEVEL_SHIFT(1) => PAGE_SHIFT = 14
      PMD_SHIFT => ARM64_HW_PGTABLE_LEVEL_SHIFT(2) => (PAGE_SHIFT - 3) + PAGE_SHIFT = 25
      PUD_SHIFT => ARM64_HW_PGTABLE_LEVEL_SHIFT(3) => 36

and so on.

Now we get HYP_PAGETABLE_LEVELS = 2

To calculate the number of concatenated entries, we need to know the total size(HYP_PGTABLE_SHIFT)
that can be mapped by the hyp(stage2) page table with HYP_PGTABLE_LEVELS(2). It is
nothing but the size mapped by a (HYP_PGTABLE_LEVELS + 1) entry.
i.e, ARM64_HW_PGTABLE_LEVEL_SHIFT(3) = 36 ( = 39 for 4K)

We can use that to calculate the number of concatenated entries, by :

	KVM_PHYS_SHIFT - HYP_PGTABLE_SHIFT

Numbering of the levels is a bit confusing. The ARM ARM numbers levels from the top bits,
while we could end up using the levels in the reverse order. Hence

#define HYP_PGTABLE_ENTRY_LEVEL (4 - HYP_PGTABLE_LEVELS)

could also create confusion. I will get rid of that and just use HYP_PGTABLE_LEVELS.

Thanks
Suzuki







>
> 	M.
>

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoffer Dall Oct. 10, 2015, 2:52 p.m. UTC | #3

Hi Suzuki,

On Tue, Sep 15, 2015 at 04:41:22PM +0100, Suzuki K. Poulose wrote:
> From: "Suzuki K. Poulose" <suzuki.poulose@arm.com>
> 
> The existing fake pgd handling code assumes that the stage-2 entry
> level can only be one level down that of the host, which may not be
> true always(e.g, with the introduction of 16k pagesize).

I had to refresh my mind a fair bit to be able to review this, so I
thought it may be useful to just remind us all what the constraints of
this whole thing is, and make sure we agree on this:

1. We fix the IPA max width to 40 bits
2. We don't support systems with a PARange smaller than 40 bits (do we
   check this anywhere or document this anywhere?)
3. We always assume we are running on a system with PARange of 40 bits
   and we are therefore constrained to use concatination.

As an implication of (3) above, this code will attempt to allocate 256K
of physically contiguous memory for each VM on the system.  That is
probably ok, but I just wanted to point it out in case it raises any
eyebrows for other people following this thread.

> 
> e.g.
> With 16k page size and 48bit VA and 40bit IPA we have the following
> split for page table levels:
> 
> level:  0       1         2         3
> bits : [47] [46 - 36] [35 - 25] [24 - 14] [13 - 0]
>          ^       ^     ^
>          |       |     |
>    host entry    |     x---- stage-2 entry
>                  |
>         IPA -----x

Isn't the stage-2 entry using bits [39:25], because you resolve
more than 11 bits on the initial level of lookup when you concatenate
tables?

> 
> The stage-2 entry level is 2, due to the concatenation of 16tables
> at level 2(mandated by the hardware). So, we need to fake two levels
> to actually reach the hyp page table. This case cannot be handled
> with the existing code, as, all we know about is KVM_PREALLOC_LEVEL
> which kind of stands for two different pieces of information.
> 
> 1) Whether we have fake page table entry levels.
> 2) The entry level of stage-2 translation.
> 
> We loose the information about the number of fake levels that
> we may have to use. Also, KVM_PREALLOC_LEVEL computation itself
> is wrong, as we assume the hw entry level is always 1 level down
> from the host.
> 
> This patch introduces two seperate indicators :
> 1) Accurate entry level for stage-2 translation - HYP_PGTABLE_ENTRY_LEVEL -
>    using the new helpers.
> 2) Number of levels of fake pagetable entries. (KVM_FAKE_PGTABLE_LEVELS)
> 
> The following conditions hold true for all cases(with 40bit IPA)
> 1) The stage-2 entry level <= 2
> 2) Number of fake page-table entries is in the inclusive range [0, 2].

nit: Number of fake levels of page tables

> 
> Cc: kvmarm@lists.cs.columbia.edu
> Cc: christoffer.dall@linaro.org
> Cc: Marc.Zyngier@arm.com
> Signed-off-by: Suzuki K. Poulose <suzuki.poulose@arm.com>
> ---
>  arch/arm64/include/asm/kvm_mmu.h |  114 ++++++++++++++++++++------------------
>  1 file changed, 61 insertions(+), 53 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index 2567fe8..72cfd9e 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -41,18 +41,6 @@
>   */
>  #define TRAMPOLINE_VA		(HYP_PAGE_OFFSET_MASK & PAGE_MASK)
>  
> -/*
> - * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation
> - * levels in addition to the PGD and potentially the PUD which are
> - * pre-allocated (we pre-allocate the fake PGD and the PUD when the Stage-2
> - * tables use one level of tables less than the kernel.
> - */
> -#ifdef CONFIG_ARM64_64K_PAGES
> -#define KVM_MMU_CACHE_MIN_PAGES	1
> -#else
> -#define KVM_MMU_CACHE_MIN_PAGES	2
> -#endif
> -
>  #ifdef __ASSEMBLY__
>  
>  /*
> @@ -80,6 +68,26 @@
>  #define KVM_PHYS_SIZE	(1UL << KVM_PHYS_SHIFT)
>  #define KVM_PHYS_MASK	(KVM_PHYS_SIZE - 1UL)
>  
> +/*
> + * At stage-2 entry level, upto 16 tables can be concatenated and

nit: Can you rewrite the first part of this comment to be in line with
the ARM ARM, such as: "The stage-2 page tables can concatenate up to 16
tables at the inital level"  ?


> + * the hardware expects us to use concatenation, whenever possible.

I think the 'hardware expects us' is a bit vague.  At least I find this
whole part of the architecture incredibly confusing already, so it would
help me in the future if we put something like:

"The hardware requires that we use concatenation depending on the
supported PARange and page size.  We always assume the hardware's PASize
is maximum 40 bits in this context, and with a fixed IPA width of 40
bits, we concatenate 2 tables for 4K pages, 16 tables for 16K pages, and
do not use concatenation for 64K pages."

Did I get this right?

> + * So, number of page table levels for KVM_PHYS_SHIFT is always
> + * the number of normal page table levels for (KVM_PHYS_SHIFT - 4).
> + */
> +#define HYP_PGTABLE_LEVELS	ARM64_HW_PGTABLE_LEVELS(KVM_PHYS_SHIFT - 4)

I see the math lines up, but I don't think it's intuitive, as I don't
understand why it's obvious that it's the 'normal' page table for
KVM_PHYS_SHIFT - 4.

I see this as an architectural limitation given in the ARM ARM, and we
should just refer to that, and do:

#if PAGE_SHIFT == 12
#define S2_PGTABLE_LEVELS	3
#else
#define S2_PGTABLE_LEVELS	2
#endif

> +/* Number of bits normally addressed by HYP_PGTABLE_LEVELS */
> +#define HYP_PGTABLE_SHIFT	ARM64_HW_PGTABLE_LEVEL_SHIFT(HYP_PGTABLE_LEVELS + 1)
> +#define HYP_PGDIR_SHIFT		ARM64_HW_PGTABLE_LEVEL_SHIFT(HYP_PGTABLE_LEVELS)
> +#define HYP_PGTABLE_ENTRY_LEVEL	(4 - HYP_PGTABLE_LEVELS)

We are introducing a huge number of defines here, which are all more or
less opaque to anyone coming back to this code.

I may be extraordinarily stupid, but I really need each define explained
in a comment to be able to follow this code (those above and the
S2_ENTRY_TABLES below).

> +/*
> + * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation
> + * levels in addition to the PGD and potentially the PUD which are
> + * pre-allocated (we pre-allocate the fake PGD and the PUD when the Stage-2
> + * tables use one level of tables less than the kernel.
> + */
> +#define KVM_MMU_CACHE_MIN_PAGES	(HYP_PGTABLE_LEVELS - 1)
> +
>  int create_hyp_mappings(void *from, void *to);
>  int create_hyp_io_mappings(void *from, void *to, phys_addr_t);
>  void free_boot_hyp_pgd(void);
> @@ -145,56 +153,41 @@ static inline bool kvm_s2pmd_readonly(pmd_t *pmd)
>  #define kvm_pud_addr_end(addr, end)	pud_addr_end(addr, end)
>  #define kvm_pmd_addr_end(addr, end)	pmd_addr_end(addr, end)
>  
> -/*
> - * In the case where PGDIR_SHIFT is larger than KVM_PHYS_SHIFT, we can address
> - * the entire IPA input range with a single pgd entry, and we would only need
> - * one pgd entry.  Note that in this case, the pgd is actually not used by
> - * the MMU for Stage-2 translations, but is merely a fake pgd used as a data
> - * structure for the kernel pgtable macros to work.
> - */
> -#if PGDIR_SHIFT > KVM_PHYS_SHIFT
> -#define PTRS_PER_S2_PGD_SHIFT	0
> +/* Number of concatenated tables in stage-2 entry level */
> +#if KVM_PHYS_SHIFT > HYP_PGTABLE_SHIFT
> +#define S2_ENTRY_TABLES_SHIFT	(KVM_PHYS_SHIFT - HYP_PGTABLE_SHIFT)
>  #else
> -#define PTRS_PER_S2_PGD_SHIFT	(KVM_PHYS_SHIFT - PGDIR_SHIFT)
> +#define S2_ENTRY_TABLES_SHIFT	0
>  #endif
> +#define S2_ENTRY_TABLES		(1 << (S2_ENTRY_TABLES_SHIFT))
> +
> +/* Number of page table levels we fake to reach the hw pgtable for hyp */
> +#define KVM_FAKE_PGTABLE_LEVELS	(CONFIG_PGTABLE_LEVELS - HYP_PGTABLE_LEVELS)
> +
> +#define PTRS_PER_S2_PGD_SHIFT	(KVM_PHYS_SHIFT - HYP_PGDIR_SHIFT)
>  #define PTRS_PER_S2_PGD		(1 << PTRS_PER_S2_PGD_SHIFT)
>  #define S2_PGD_ORDER		get_order(PTRS_PER_S2_PGD * sizeof(pgd_t))
>  
>  #define kvm_pgd_index(addr)	(((addr) >> PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
>  
> -/*
> - * If we are concatenating first level stage-2 page tables, we would have less
> - * than or equal to 16 pointers in the fake PGD, because that's what the
> - * architecture allows.  In this case, (4 - CONFIG_PGTABLE_LEVELS)
> - * represents the first level for the host, and we add 1 to go to the next
> - * level (which uses contatenation) for the stage-2 tables.
> - */
> -#if PTRS_PER_S2_PGD <= 16
> -#define KVM_PREALLOC_LEVEL	(4 - CONFIG_PGTABLE_LEVELS + 1)
> -#else
> -#define KVM_PREALLOC_LEVEL	(0)
> -#endif
> -
>  static inline void *kvm_get_hwpgd(struct kvm *kvm)
>  {
>  	pgd_t *pgd = kvm->arch.pgd;
>  	pud_t *pud;
>  
> -	if (KVM_PREALLOC_LEVEL == 0)
> +	if (KVM_FAKE_PGTABLE_LEVELS == 0)
>  		return pgd;
>  
>  	pud = pud_offset(pgd, 0);
> -	if (KVM_PREALLOC_LEVEL == 1)
> +	if (HYP_PGTABLE_ENTRY_LEVEL == 1)
>  		return pud;
>  
> -	BUG_ON(KVM_PREALLOC_LEVEL != 2);
> +	BUG_ON(HYP_PGTABLE_ENTRY_LEVEL != 2);
>  	return pmd_offset(pud, 0);
>  }
>  
>  static inline unsigned int kvm_get_hwpgd_size(void)
>  {
> -	if (KVM_PREALLOC_LEVEL > 0)
> -		return PTRS_PER_S2_PGD * PAGE_SIZE;
>  	return PTRS_PER_S2_PGD * sizeof(pgd_t);
>  }
>  
> @@ -207,27 +200,38 @@ static inline pgd_t* kvm_setup_fake_pgd(pgd_t *hwpgd)
>  {
>  	int i;
>  	pgd_t *pgd;
> +	pud_t *pud;
>  
> -	if (!KVM_PREALLOC_LEVEL)
> +	if (KVM_FAKE_PGTABLE_LEVELS == 0)
>  		return hwpgd;
> -	/*
> -	 * When KVM_PREALLOC_LEVEL==2, we allocate a single page for
> -	 * the PMD and the kernel will use folded pud.
> -	 * When KVM_PREALLOC_LEVEL==1, we allocate 2 consecutive PUD
> -	 * pages.
> -	 */
> +
>  	pgd = kmalloc(PTRS_PER_S2_PGD * sizeof(pgd_t),
>  			GFP_KERNEL | __GFP_ZERO);
>  
>  	if (!pgd)
>  		return ERR_PTR(-ENOMEM);
> +	/*
> +	 * If the stage-2 entry is two level down from that of the host,
> +	 * we are using a 4-level table on host (since HYP_PGTABLE_ENTRY_LEVEL
> +	 * cannot be < 2. So, this implies we need to allocat a PUD table
> +	 * to map the concatenated PMD tables.
> +	 */
> +	if (KVM_FAKE_PGTABLE_LEVELS == 2) {
> +		pud = (pud_t *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, 0);
> +		if (!pud) {
> +			kfree(pgd);
> +			return ERR_PTR(-ENOMEM);
> +		}
> +		/* plug the pud into the PGD */
> +		pgd_populate(NULL, pgd, pud);
> +	}
>  
>  	/* Plug the HW PGD into the fake one. */
> -	for (i = 0; i < PTRS_PER_S2_PGD; i++) {
> -		if (KVM_PREALLOC_LEVEL == 1)
> +	for (i = 0; i < S2_ENTRY_TABLES; i++) {
> +		if (HYP_PGTABLE_ENTRY_LEVEL == 1)
>  			pgd_populate(NULL, pgd + i,
>  				     (pud_t *)hwpgd + i * PTRS_PER_PUD);
> -		else if (KVM_PREALLOC_LEVEL == 2)
> +		else if (HYP_PGTABLE_ENTRY_LEVEL == 2)
>  			pud_populate(NULL, pud_offset(pgd, 0) + i,
>  				     (pmd_t *)hwpgd + i * PTRS_PER_PMD);
>  	}
> @@ -237,8 +241,12 @@ static inline pgd_t* kvm_setup_fake_pgd(pgd_t *hwpgd)
>  
>  static inline void kvm_free_fake_pgd(pgd_t *pgd)
>  {
> -	if (KVM_PREALLOC_LEVEL > 0)
> +	if (KVM_FAKE_PGTABLE_LEVELS > 0) {
> +		/* free the PUD table */
> +		if (KVM_FAKE_PGTABLE_LEVELS == 2)
> +			free_page((unsigned long)pud_offset(pgd, 0));
>  		kfree(pgd);
> +	}
>  }
>  
>  static inline bool kvm_page_empty(void *ptr)
> @@ -253,14 +261,14 @@ static inline bool kvm_page_empty(void *ptr)
>  #define kvm_pmd_table_empty(kvm, pmdp) (0)
>  #else
>  #define kvm_pmd_table_empty(kvm, pmdp) \
> -	(kvm_page_empty(pmdp) && (!(kvm) || KVM_PREALLOC_LEVEL < 2))
> +	(kvm_page_empty(pmdp) && (!(kvm) || HYP_PGTABLE_ENTRY_LEVEL < 2))
>  #endif
>  
>  #ifdef __PAGETABLE_PUD_FOLDED
>  #define kvm_pud_table_empty(kvm, pudp) (0)
>  #else
>  #define kvm_pud_table_empty(kvm, pudp) \
> -	(kvm_page_empty(pudp) && (!(kvm) || KVM_PREALLOC_LEVEL < 1))
> +	(kvm_page_empty(pudp) && (!(kvm) || HYP_PGTABLE_ENTRY_LEVEL < 1))
>  #endif
>  
>  
> -- 
> 1.7.9.5
> 

I actually wonder from looking at this whole patch if we even want to go
here.  Maybe this is really the time to say that we should get rid of
the dependency between the host page table layout and the stage-2 page
table layout.

Since the rest of this series looks pretty good, I'm wondering if you
should just disable KVM in the config system if 16K pages is selected,
and then you can move ahead with this series while we fix KVM properly?

[Sorry for taking so long to get around to looking at this].

Thanks,
-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Suzuki K Poulose Oct. 12, 2015, 9:55 a.m. UTC | #4

On 10/10/15 15:52, Christoffer Dall wrote:
> Hi Suzuki,

Hi Christoffer,

Thanks for being patient enough to review the code :-) without much of
the comments. I now realise there needs much more documentation than
what I have put in already. I am taking care of this in the next
revision already.

> I had to refresh my mind a fair bit to be able to review this, so I
> thought it may be useful to just remind us all what the constraints of
> this whole thing is, and make sure we agree on this:
>
> 1. We fix the IPA max width to 40 bits
> 2. We don't support systems with a PARange smaller than 40 bits (do we
>     check this anywhere or document this anywhere?)

AFAIT, no we don't check it anywhere. May be we should. We could plug this
into my CPU feature infrastructure[1] and let the is_hype_mode_available()
use the info to decide if we can support 40bit IPA ?

> 3. We always assume we are running on a system with PARange of 40 bits
>     and we are therefore constrained to use concatination.
>
> As an implication of (3) above, this code will attempt to allocate 256K
> of physically contiguous memory for each VM on the system.  That is
> probably ok, but I just wanted to point it out in case it raises any
> eyebrows for other people following this thread.

Right, I will document this in a comment.

>> level:  0       1         2         3
>> bits : [47] [46 - 36] [35 - 25] [24 - 14] [13 - 0]
>>           ^       ^     ^
>>           |       |     |
>>     host entry    |     x---- stage-2 entry
>>                   |
>>          IPA -----x
>
> Isn't the stage-2 entry using bits [39:25], because you resolve
> more than 11 bits on the initial level of lookup when you concatenate
> tables?

Yes, the stage-2 entry is just supposed to show the entry level (2).

>>
>> The following conditions hold true for all cases(with 40bit IPA)
>> 1) The stage-2 entry level <= 2
>> 2) Number of fake page-table entries is in the inclusive range [0, 2].
>
> nit: Number of fake levels of page tables

Correct, I have fixed it already.


>> +/*
>> + * At stage-2 entry level, upto 16 tables can be concatenated and
>
> nit: Can you rewrite the first part of this comment to be in line with
> the ARM ARM, such as: "The stage-2 page tables can concatenate up to 16
> tables at the inital level"  ?

Yes, will do it.

>
>
>> + * the hardware expects us to use concatenation, whenever possible.
>
> I think the 'hardware expects us' is a bit vague.  At least I find this
> whole part of the architecture incredibly confusing already, so it would
> help me in the future if we put something like:
>
> "The hardware requires that we use concatenation depending on the
> supported PARange and page size.  We always assume the hardware's PASize
> is maximum 40 bits in this context, and with a fixed IPA width of 40
> bits, we concatenate 2 tables for 4K pages, 16 tables for 16K pages, and
> do not use concatenation for 64K pages."
>
> Did I get this right?

You are right. The rule is simple. Upto 16 tables can be concatenated at
the stage-2 entry level.

>
>> + * So, number of page table levels for KVM_PHYS_SHIFT is always
>> + * the number of normal page table levels for (KVM_PHYS_SHIFT - 4).
>> + */
>> +#define HYP_PGTABLE_LEVELS	ARM64_HW_PGTABLE_LEVELS(KVM_PHYS_SHIFT - 4)
>
> I see the math lines up, but I don't think it's intuitive, as I don't
> understand why it's obvious that it's the 'normal' page table for
> KVM_PHYS_SHIFT - 4.

Because, we can concatenate upto 16 page table entries. With the current
set of page sizes the above 'magic' formula works out. But yes, the following
suggestion makes more sense.

>
> I see this as an architectural limitation given in the ARM ARM, and we
> should just refer to that, and do:
>
> #if PAGE_SHIFT == 12
> #define S2_PGTABLE_LEVELS	3
> #else
> #define S2_PGTABLE_LEVELS	2
> #endif

OK, we could do that.

>
>> +/* Number of bits normally addressed by HYP_PGTABLE_LEVELS */
>> +#define HYP_PGTABLE_SHIFT	ARM64_HW_PGTABLE_LEVEL_SHIFT(HYP_PGTABLE_LEVELS + 1)
>> +#define HYP_PGDIR_SHIFT		ARM64_HW_PGTABLE_LEVEL_SHIFT(HYP_PGTABLE_LEVELS)
>> +#define HYP_PGTABLE_ENTRY_LEVEL	(4 - HYP_PGTABLE_LEVELS)
>
> We are introducing a huge number of defines here, which are all more or
> less opaque to anyone coming back to this code.
>
> I may be extraordinarily stupid, but I really need each define explained
> in a comment to be able to follow this code (those above and the
> S2_ENTRY_TABLES below).

No, you right. I need to document all the above properly, which I is something
I am in the middle of.

>
> I actually wonder from looking at this whole patch if we even want to go
> here.  Maybe this is really the time to say that we should get rid of
> the dependency between the host page table layout and the stage-2 page
> table layout.
>
> Since the rest of this series looks pretty good, I'm wondering if you
> should just disable KVM in the config system if 16K pages is selected,
> and then you can move ahead with this series while we fix KVM properly?

I can send an updated version (which is in the test furnace) soon, so that
you can take a look ?

Suzuki

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoffer Dall Oct. 13, 2015, 3:39 p.m. UTC | #5

On Mon, Oct 12, 2015 at 10:55:24AM +0100, Suzuki K. Poulose wrote:
> On 10/10/15 15:52, Christoffer Dall wrote:
> >Hi Suzuki,
> 
> Hi Christoffer,
> 
> Thanks for being patient enough to review the code :-) without much of
> the comments. I now realise there needs much more documentation than
> what I have put in already. I am taking care of this in the next
> revision already.
> 
> >I had to refresh my mind a fair bit to be able to review this, so I
> >thought it may be useful to just remind us all what the constraints of
> >this whole thing is, and make sure we agree on this:
> >
> >1. We fix the IPA max width to 40 bits
> >2. We don't support systems with a PARange smaller than 40 bits (do we
> >    check this anywhere or document this anywhere?)
> 
> AFAIT, no we don't check it anywhere. May be we should. We could plug this
> into my CPU feature infrastructure[1] and let the is_hype_mode_available()
> use the info to decide if we can support 40bit IPA ?
> 

If we support 40bit IPA or more, yes, I think that would be sane.  Or at
least put a comment somewhere, perhaps in Documenation.

> >3. We always assume we are running on a system with PARange of 40 bits
> >    and we are therefore constrained to use concatination.
> >
> >As an implication of (3) above, this code will attempt to allocate 256K
> >of physically contiguous memory for each VM on the system.  That is
> >probably ok, but I just wanted to point it out in case it raises any
> >eyebrows for other people following this thread.
> 
> Right, I will document this in a comment.
> 
> >>level:  0       1         2         3
> >>bits : [47] [46 - 36] [35 - 25] [24 - 14] [13 - 0]
> >>          ^       ^     ^
> >>          |       |     |
> >>    host entry    |     x---- stage-2 entry
> >>                  |
> >>         IPA -----x
> >
> >Isn't the stage-2 entry using bits [39:25], because you resolve
> >more than 11 bits on the initial level of lookup when you concatenate
> >tables?
> 
> Yes, the stage-2 entry is just supposed to show the entry level (2).
> 

I don't understand, the stage-2 entry level will be at bit 39, not 35?

Thanks,
-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Suzuki K Poulose Oct. 13, 2015, 4:04 p.m. UTC | #6

On 13/10/15 16:39, Christoffer Dall wrote:
> On Mon, Oct 12, 2015 at 10:55:24AM +0100, Suzuki K. Poulose wrote:
>> On 10/10/15 15:52, Christoffer Dall wrote:
>>> Hi Suzuki,
>>
>> Hi Christoffer,
>>
>> Thanks for being patient enough to review the code :-) without much of
>> the comments. I now realise there needs much more documentation than
>> what I have put in already. I am taking care of this in the next
>> revision already.
>>
>>> I had to refresh my mind a fair bit to be able to review this, so I
>>> thought it may be useful to just remind us all what the constraints of
>>> this whole thing is, and make sure we agree on this:
>>>
>>> 1. We fix the IPA max width to 40 bits
>>> 2. We don't support systems with a PARange smaller than 40 bits (do we
>>>     check this anywhere or document this anywhere?)
>>
>> AFAIT, no we don't check it anywhere. May be we should. We could plug this
>> into my CPU feature infrastructure[1] and let the is_hype_mode_available()
>> use the info to decide if we can support 40bit IPA ?
>>
>
> If we support 40bit IPA or more, yes, I think that would be sane.  Or at
> least put a comment somewhere, perhaps in Documenation.

OK

>>> 3. We always assume we are running on a system with PARange of 40 bits
>>>     and we are therefore constrained to use concatination.
>>>
>>> As an implication of (3) above, this code will attempt to allocate 256K
>>> of physically contiguous memory for each VM on the system.  That is
>>> probably ok, but I just wanted to point it out in case it raises any
>>> eyebrows for other people following this thread.
>>
>> Right, I will document this in a comment.
>>
>>>> level:  0       1         2         3
>>>> bits : [47] [46 - 36] [35 - 25] [24 - 14] [13 - 0]
>>>>           ^       ^     ^
>>>>           |       |     |
>>>>     host entry    |     x---- stage-2 entry
>>>>                   |
>>>>          IPA -----x
>>>
>>> Isn't the stage-2 entry using bits [39:25], because you resolve
>>> more than 11 bits on the initial level of lookup when you concatenate
>>> tables?
>>
>> Yes, the stage-2 entry is just supposed to show the entry level (2).
>>
>
> I don't understand, the stage-2 entry level will be at bit 39, not 35?
>

That picture shows the 'level 2' at which the stage-2 translations begin,
with 16 pages concatenated, which gives 39-25. The host kernel macros,
normally only sees upto bit 35, which is fixed using the kvm_pgd_index()
to pick the right PGD entry for a VA.

Thanks
Suzuki

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[13/15] arm64: kvm: Rewrite fake pgd handling

Commit Message

Comments

Patch