diff mbox series

[RFC,2/3] KVM: arm64: Fix handling of merging tables into a block entry

Message ID 20201130121847.91808-3-wangyanan55@huawei.com (mailing list archive)
State New, archived
Headers show
Series Fix several bugs in KVM stage 2 translation | expand

Commit Message

Yanan Wang Nov. 30, 2020, 12:18 p.m. UTC
In dirty logging case(logging_active == True), we need to collapse a block
entry into a table if necessary. After dirty logging is canceled, when merging
tables back into a block entry, we should not only free the non-huge page
tables but also unmap the non-huge mapping for the block. Without the unmap,
inconsistent TLB entries for the pages in the the block will be created.

We could also use unmap_stage2_range API to unmap the non-huge mapping,
but this could potentially free the upper level page-table page which
will be useful later.

Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

Comments

Will Deacon Nov. 30, 2020, 1:34 p.m. UTC | #1
On Mon, Nov 30, 2020 at 08:18:46PM +0800, Yanan Wang wrote:
> In dirty logging case(logging_active == True), we need to collapse a block
> entry into a table if necessary. After dirty logging is canceled, when merging
> tables back into a block entry, we should not only free the non-huge page
> tables but also unmap the non-huge mapping for the block. Without the unmap,
> inconsistent TLB entries for the pages in the the block will be created.
> 
> We could also use unmap_stage2_range API to unmap the non-huge mapping,
> but this could potentially free the upper level page-table page which
> will be useful later.
> 
> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> ---
>  arch/arm64/kvm/hyp/pgtable.c | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 696b6aa83faf..fec8dc9f2baa 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -500,6 +500,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>  	return 0;
>  }
>  
> +static void stage2_flush_dcache(void *addr, u64 size);
> +static bool stage2_pte_cacheable(kvm_pte_t pte);
> +
>  static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>  				struct stage2_map_data *data)
>  {
> @@ -507,9 +510,17 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>  	struct page *page = virt_to_page(ptep);
>  
>  	if (data->anchor) {
> -		if (kvm_pte_valid(pte))
> +		if (kvm_pte_valid(pte)) {
> +			kvm_set_invalid_pte(ptep);
> +			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
> +				     addr, level);
>  			put_page(page);

This doesn't make sense to me: the page-table pages we're walking when the
anchor is set are not accessible to the hardware walker because we unhooked
the entire sub-table in stage2_map_walk_table_pre(), which has the necessary
TLB invalidation.

Are you seeing a problem in practice here?

> +			if (stage2_pte_cacheable(pte))
> +				stage2_flush_dcache(kvm_pte_follow(pte),
> +						    kvm_granule_size(level));

I don't understand the need for the flush either, as we're just coalescing
existing entries into a larger block mapping.

Will
Yanan Wang Nov. 30, 2020, 3:24 p.m. UTC | #2
On 2020/11/30 21:34, Will Deacon wrote:
> On Mon, Nov 30, 2020 at 08:18:46PM +0800, Yanan Wang wrote:
>> In dirty logging case(logging_active == True), we need to collapse a block
>> entry into a table if necessary. After dirty logging is canceled, when merging
>> tables back into a block entry, we should not only free the non-huge page
>> tables but also unmap the non-huge mapping for the block. Without the unmap,
>> inconsistent TLB entries for the pages in the the block will be created.
>>
>> We could also use unmap_stage2_range API to unmap the non-huge mapping,
>> but this could potentially free the upper level page-table page which
>> will be useful later.
>>
>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>> ---
>>   arch/arm64/kvm/hyp/pgtable.c | 15 +++++++++++++--
>>   1 file changed, 13 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>> index 696b6aa83faf..fec8dc9f2baa 100644
>> --- a/arch/arm64/kvm/hyp/pgtable.c
>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>> @@ -500,6 +500,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>>   	return 0;
>>   }
>>   
>> +static void stage2_flush_dcache(void *addr, u64 size);
>> +static bool stage2_pte_cacheable(kvm_pte_t pte);
>> +
>>   static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>   				struct stage2_map_data *data)
>>   {
>> @@ -507,9 +510,17 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>   	struct page *page = virt_to_page(ptep);
>>   
>>   	if (data->anchor) {
>> -		if (kvm_pte_valid(pte))
>> +		if (kvm_pte_valid(pte)) {
>> +			kvm_set_invalid_pte(ptep);
>> +			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
>> +				     addr, level);
>>   			put_page(page);
> This doesn't make sense to me: the page-table pages we're walking when the
> anchor is set are not accessible to the hardware walker because we unhooked
> the entire sub-table in stage2_map_walk_table_pre(), which has the necessary
> TLB invalidation.
>
> Are you seeing a problem in practice here?

Yes, I indeed find a problem in practice.

When the migration was cancelled, a TLB conflic abort  was found in guest.

This problem is fixed before rework of the page table code, you can have 
a look in the following two links:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c3736cd32bf5197aed1410ae826d2d254a5b277

https://lists.cs.columbia.edu/pipermail/kvmarm/2019-March/035031.html

>> +			if (stage2_pte_cacheable(pte))
>> +				stage2_flush_dcache(kvm_pte_follow(pte),
>> +						    kvm_granule_size(level));
> I don't understand the need for the flush either, as we're just coalescing
> existing entries into a larger block mapping.

In my opinion, after unmapping, it is necessary to ensure the cache 
coherency, because it is unknown whether the future mapping memory 
attribute is changed or not (cacheable -> non_cacheable) theoretically.

> Will
> .
Will Deacon Nov. 30, 2020, 4:01 p.m. UTC | #3
Hi,

Cheers for the quick reply. See below for more questions...

On Mon, Nov 30, 2020 at 11:24:19PM +0800, wangyanan (Y) wrote:
> On 2020/11/30 21:34, Will Deacon wrote:
> > On Mon, Nov 30, 2020 at 08:18:46PM +0800, Yanan Wang wrote:
> > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > index 696b6aa83faf..fec8dc9f2baa 100644
> > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > @@ -500,6 +500,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
> > >   	return 0;
> > >   }
> > > +static void stage2_flush_dcache(void *addr, u64 size);
> > > +static bool stage2_pte_cacheable(kvm_pte_t pte);
> > > +
> > >   static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > >   				struct stage2_map_data *data)
> > >   {
> > > @@ -507,9 +510,17 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > >   	struct page *page = virt_to_page(ptep);
> > >   	if (data->anchor) {
> > > -		if (kvm_pte_valid(pte))
> > > +		if (kvm_pte_valid(pte)) {
> > > +			kvm_set_invalid_pte(ptep);
> > > +			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
> > > +				     addr, level);
> > >   			put_page(page);
> > This doesn't make sense to me: the page-table pages we're walking when the
> > anchor is set are not accessible to the hardware walker because we unhooked
> > the entire sub-table in stage2_map_walk_table_pre(), which has the necessary
> > TLB invalidation.
> > 
> > Are you seeing a problem in practice here?
> 
> Yes, I indeed find a problem in practice.
> 
> When the migration was cancelled, a TLB conflic abort  was found in guest.
> 
> This problem is fixed before rework of the page table code, you can have a
> look in the following two links:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c3736cd32bf5197aed1410ae826d2d254a5b277
> 
> https://lists.cs.columbia.edu/pipermail/kvmarm/2019-March/035031.html

Ok, let's go through this, because I still don't see the bug. Please correct
me if you spot any mistakes:

  1. We have a block mapping for X => Y
  2. Dirty logging is enabled, so the block mapping is write-protected and
     ends up being split into page mappings
  3. Dirty logging is disabled due to a failed migration.

--- At this point, I think we agree that the state of the MMU is alright ---

  4. We take a stage-2 fault and want to reinstall the block mapping:

     a. kvm_pgtable_stage2_map() is invoked to install the block mapping
     b. stage2_map_walk_table_pre() finds a table where we would like to
        install the block:

	i.   The anchor is set to point at this entry
	ii.  The entry is made invalid
	iii. We invalidate the TLB for the input address. This is
	     TLBI IPAS2SE1IS without level hint and then TLBI VMALLE1IS.

	*** At this point, the page-table pointed to by the old table entry
	    is not reachable to the hardware walker ***

     c. stage2_map_walk_leaf() is called for each leaf entry in the
        now-unreachable subtree, dropping page-references for each valid
	entry it finds.
     d. stage2_map_walk_table_post() is eventually called for the entry
        which we cleared back in b.ii, so we install the new block mapping.

You are proposing to add additional TLB invalidation to (c), but I don't
think that is necessary, thanks to the invalidation already performed in
b.iii. What am I missing here?

> > > +			if (stage2_pte_cacheable(pte))
> > > +				stage2_flush_dcache(kvm_pte_follow(pte),
> > > +						    kvm_granule_size(level));
> > I don't understand the need for the flush either, as we're just coalescing
> > existing entries into a larger block mapping.
> 
> In my opinion, after unmapping, it is necessary to ensure the cache
> coherency, because it is unknown whether the future mapping memory attribute
> is changed or not (cacheable -> non_cacheable) theoretically.

But in this case we're just changing the structure of the page-tables,
not the pages which are mapped, are we?

Will
Yanan Wang Dec. 1, 2020, 2:30 a.m. UTC | #4
Hi Will,

On 2020/12/1 0:01, Will Deacon wrote:
> Hi,
>
> Cheers for the quick reply. See below for more questions...
>
> On Mon, Nov 30, 2020 at 11:24:19PM +0800, wangyanan (Y) wrote:
>> On 2020/11/30 21:34, Will Deacon wrote:
>>> On Mon, Nov 30, 2020 at 08:18:46PM +0800, Yanan Wang wrote:
>>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>>> index 696b6aa83faf..fec8dc9f2baa 100644
>>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>>> @@ -500,6 +500,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>>>>    	return 0;
>>>>    }
>>>> +static void stage2_flush_dcache(void *addr, u64 size);
>>>> +static bool stage2_pte_cacheable(kvm_pte_t pte);
>>>> +
>>>>    static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>>>    				struct stage2_map_data *data)
>>>>    {
>>>> @@ -507,9 +510,17 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>>>    	struct page *page = virt_to_page(ptep);
>>>>    	if (data->anchor) {
>>>> -		if (kvm_pte_valid(pte))
>>>> +		if (kvm_pte_valid(pte)) {
>>>> +			kvm_set_invalid_pte(ptep);
>>>> +			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
>>>> +				     addr, level);
>>>>    			put_page(page);
>>> This doesn't make sense to me: the page-table pages we're walking when the
>>> anchor is set are not accessible to the hardware walker because we unhooked
>>> the entire sub-table in stage2_map_walk_table_pre(), which has the necessary
>>> TLB invalidation.
>>>
>>> Are you seeing a problem in practice here?
>> Yes, I indeed find a problem in practice.
>>
>> When the migration was cancelled, a TLB conflic abort  was found in guest.
>>
>> This problem is fixed before rework of the page table code, you can have a
>> look in the following two links:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c3736cd32bf5197aed1410ae826d2d254a5b277
>>
>> https://lists.cs.columbia.edu/pipermail/kvmarm/2019-March/035031.html
> Ok, let's go through this, because I still don't see the bug. Please correct
> me if you spot any mistakes:
>
>    1. We have a block mapping for X => Y
>    2. Dirty logging is enabled, so the block mapping is write-protected and
>       ends up being split into page mappings
>    3. Dirty logging is disabled due to a failed migration.
>
> --- At this point, I think we agree that the state of the MMU is alright ---
>
>    4. We take a stage-2 fault and want to reinstall the block mapping:
>
>       a. kvm_pgtable_stage2_map() is invoked to install the block mapping
>       b. stage2_map_walk_table_pre() finds a table where we would like to
>          install the block:
>
> 	i.   The anchor is set to point at this entry
> 	ii.  The entry is made invalid
> 	iii. We invalidate the TLB for the input address. This is
> 	     TLBI IPAS2SE1IS without level hint and then TLBI VMALLE1IS.
>
> 	*** At this point, the page-table pointed to by the old table entry
> 	    is not reachable to the hardware walker ***
>
>       c. stage2_map_walk_leaf() is called for each leaf entry in the
>          now-unreachable subtree, dropping page-references for each valid
> 	entry it finds.
>       d. stage2_map_walk_table_post() is eventually called for the entry
>          which we cleared back in b.ii, so we install the new block mapping.
>
> You are proposing to add additional TLB invalidation to (c), but I don't
> think that is necessary, thanks to the invalidation already performed in
> b.iii. What am I missing here?

The point is at b.iii where the TLBI is not enough. There are many page 
mappings that we need to merge into a block mapping.

We invalidate the TLB for the input address without level hint at b.iii, 
but this operation just flush TLB for one page mapping, there

are still some TLB entries for the other page mappings in the cache, the 
MMU hardware walker can still hit these entries next time.


Maybe we can imagine a concrete example here. If we now need to merge 
page mappings into a 1G block mapping, and the

fault_ipa to user_mem_abort() is 0x225043000, after ALIGNMENT to 1G, the 
input address will be 0x200000000, then the TLBI

operation at b.iii will invalidate the TLB entry for address 
0x200000000. But what about address 0x200001000 , 0x200002000 ... ?

After the installing of 1G block mapping is finished, when the fault_ipa 
is 0x200007000, MMU can still hit the TLB entry for page

mapping in the cache and can also access memory through the new block entry.


So adding TLBI operation in stage2_map_walk_leaf() aims to invalidate 
TLB entries for all the page mappings that will be merged.

In this way, after installing of block mapping, MMU can only access 
memory through the new block entry.

>>>> +			if (stage2_pte_cacheable(pte))
>>>> +				stage2_flush_dcache(kvm_pte_follow(pte),
>>>> +						    kvm_granule_size(level));
>>> I don't understand the need for the flush either, as we're just coalescing
>>> existing entries into a larger block mapping.
>> In my opinion, after unmapping, it is necessary to ensure the cache
>> coherency, because it is unknown whether the future mapping memory attribute
>> is changed or not (cacheable -> non_cacheable) theoretically.
> But in this case we're just changing the structure of the page-tables,
> not the pages which are mapped, are we?
>
> Will
> .

Yes, there is not a condition for now that cache attributes will be 
changed in this case.

Maybe this part of dcache flush can be cut.


Yanan
Will Deacon Dec. 1, 2020, 1:46 p.m. UTC | #5
On Tue, Dec 01, 2020 at 10:30:41AM +0800, wangyanan (Y) wrote:
> On 2020/12/1 0:01, Will Deacon wrote:
> > On Mon, Nov 30, 2020 at 11:24:19PM +0800, wangyanan (Y) wrote:
> > > On 2020/11/30 21:34, Will Deacon wrote:
> > > > On Mon, Nov 30, 2020 at 08:18:46PM +0800, Yanan Wang wrote:
> > > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > > > index 696b6aa83faf..fec8dc9f2baa 100644
> > > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > > @@ -500,6 +500,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
> > > > >    	return 0;
> > > > >    }
> > > > > +static void stage2_flush_dcache(void *addr, u64 size);
> > > > > +static bool stage2_pte_cacheable(kvm_pte_t pte);
> > > > > +
> > > > >    static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > > > >    				struct stage2_map_data *data)
> > > > >    {
> > > > > @@ -507,9 +510,17 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > > > >    	struct page *page = virt_to_page(ptep);
> > > > >    	if (data->anchor) {
> > > > > -		if (kvm_pte_valid(pte))
> > > > > +		if (kvm_pte_valid(pte)) {
> > > > > +			kvm_set_invalid_pte(ptep);
> > > > > +			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
> > > > > +				     addr, level);
> > > > >    			put_page(page);
> > > > This doesn't make sense to me: the page-table pages we're walking when the
> > > > anchor is set are not accessible to the hardware walker because we unhooked
> > > > the entire sub-table in stage2_map_walk_table_pre(), which has the necessary
> > > > TLB invalidation.
> > > > 
> > > > Are you seeing a problem in practice here?
> > > Yes, I indeed find a problem in practice.
> > > 
> > > When the migration was cancelled, a TLB conflic abort  was found in guest.
> > > 
> > > This problem is fixed before rework of the page table code, you can have a
> > > look in the following two links:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c3736cd32bf5197aed1410ae826d2d254a5b277
> > > 
> > > https://lists.cs.columbia.edu/pipermail/kvmarm/2019-March/035031.html
> > Ok, let's go through this, because I still don't see the bug. Please correct
> > me if you spot any mistakes:
> > 
> >    1. We have a block mapping for X => Y
> >    2. Dirty logging is enabled, so the block mapping is write-protected and
> >       ends up being split into page mappings
> >    3. Dirty logging is disabled due to a failed migration.
> > 
> > --- At this point, I think we agree that the state of the MMU is alright ---
> > 
> >    4. We take a stage-2 fault and want to reinstall the block mapping:
> > 
> >       a. kvm_pgtable_stage2_map() is invoked to install the block mapping
> >       b. stage2_map_walk_table_pre() finds a table where we would like to
> >          install the block:
> > 
> > 	i.   The anchor is set to point at this entry
> > 	ii.  The entry is made invalid
> > 	iii. We invalidate the TLB for the input address. This is
> > 	     TLBI IPAS2SE1IS without level hint and then TLBI VMALLE1IS.
> > 
> > 	*** At this point, the page-table pointed to by the old table entry
> > 	    is not reachable to the hardware walker ***
> > 
> >       c. stage2_map_walk_leaf() is called for each leaf entry in the
> >          now-unreachable subtree, dropping page-references for each valid
> > 	entry it finds.
> >       d. stage2_map_walk_table_post() is eventually called for the entry
> >          which we cleared back in b.ii, so we install the new block mapping.
> > 
> > You are proposing to add additional TLB invalidation to (c), but I don't
> > think that is necessary, thanks to the invalidation already performed in
> > b.iii. What am I missing here?
> 
> The point is at b.iii where the TLBI is not enough. There are many page
> mappings that we need to merge into a block mapping.
> 
> We invalidate the TLB for the input address without level hint at b.iii, but
> this operation just flush TLB for one page mapping, there
> 
> are still some TLB entries for the other page mappings in the cache, the MMU
> hardware walker can still hit these entries next time.

Ah, yes, I see. Thanks. I hadn't considered the case where there are table
entries beneath the anchor. So how about the diff below?

Will

--->8

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 0271b4a3b9fe..12526d8c7ae4 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -493,7 +493,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 		return 0;
 
 	kvm_set_invalid_pte(ptep);
-	kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, 0);
+	/* TLB invalidation is deferred until the _post handler */
 	data->anchor = ptep;
 	return 0;
 }
@@ -547,11 +547,21 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
 				      struct stage2_map_data *data)
 {
 	int ret = 0;
+	kvm_pte_t pte = *ptep;
 
 	if (!data->anchor)
 		return 0;
 
-	free_page((unsigned long)kvm_pte_follow(*ptep));
+	kvm_set_invalid_pte(ptep);
+
+	/*
+	 * Invalidate the whole stage-2, as we may have numerous leaf
+	 * entries below us which would otherwise need invalidating
+	 * individually.
+	 */
+	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
+
+	free_page((unsigned long)kvm_pte_follow(pte));
 	put_page(virt_to_page(ptep));
 
 	if (data->anchor == ptep) {
Marc Zyngier Dec. 1, 2020, 2:05 p.m. UTC | #6
On 2020-12-01 13:46, Will Deacon wrote:
> On Tue, Dec 01, 2020 at 10:30:41AM +0800, wangyanan (Y) wrote:
>> On 2020/12/1 0:01, Will Deacon wrote:
>> > On Mon, Nov 30, 2020 at 11:24:19PM +0800, wangyanan (Y) wrote:
>> > > On 2020/11/30 21:34, Will Deacon wrote:
>> > > > On Mon, Nov 30, 2020 at 08:18:46PM +0800, Yanan Wang wrote:
>> > > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>> > > > > index 696b6aa83faf..fec8dc9f2baa 100644
>> > > > > --- a/arch/arm64/kvm/hyp/pgtable.c
>> > > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
>> > > > > @@ -500,6 +500,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>> > > > >    	return 0;
>> > > > >    }
>> > > > > +static void stage2_flush_dcache(void *addr, u64 size);
>> > > > > +static bool stage2_pte_cacheable(kvm_pte_t pte);
>> > > > > +
>> > > > >    static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>> > > > >    				struct stage2_map_data *data)
>> > > > >    {
>> > > > > @@ -507,9 +510,17 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>> > > > >    	struct page *page = virt_to_page(ptep);
>> > > > >    	if (data->anchor) {
>> > > > > -		if (kvm_pte_valid(pte))
>> > > > > +		if (kvm_pte_valid(pte)) {
>> > > > > +			kvm_set_invalid_pte(ptep);
>> > > > > +			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
>> > > > > +				     addr, level);
>> > > > >    			put_page(page);
>> > > > This doesn't make sense to me: the page-table pages we're walking when the
>> > > > anchor is set are not accessible to the hardware walker because we unhooked
>> > > > the entire sub-table in stage2_map_walk_table_pre(), which has the necessary
>> > > > TLB invalidation.
>> > > >
>> > > > Are you seeing a problem in practice here?
>> > > Yes, I indeed find a problem in practice.
>> > >
>> > > When the migration was cancelled, a TLB conflic abort  was found in guest.
>> > >
>> > > This problem is fixed before rework of the page table code, you can have a
>> > > look in the following two links:
>> > >
>> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c3736cd32bf5197aed1410ae826d2d254a5b277
>> > >
>> > > https://lists.cs.columbia.edu/pipermail/kvmarm/2019-March/035031.html
>> > Ok, let's go through this, because I still don't see the bug. Please correct
>> > me if you spot any mistakes:
>> >
>> >    1. We have a block mapping for X => Y
>> >    2. Dirty logging is enabled, so the block mapping is write-protected and
>> >       ends up being split into page mappings
>> >    3. Dirty logging is disabled due to a failed migration.
>> >
>> > --- At this point, I think we agree that the state of the MMU is alright ---
>> >
>> >    4. We take a stage-2 fault and want to reinstall the block mapping:
>> >
>> >       a. kvm_pgtable_stage2_map() is invoked to install the block mapping
>> >       b. stage2_map_walk_table_pre() finds a table where we would like to
>> >          install the block:
>> >
>> > 	i.   The anchor is set to point at this entry
>> > 	ii.  The entry is made invalid
>> > 	iii. We invalidate the TLB for the input address. This is
>> > 	     TLBI IPAS2SE1IS without level hint and then TLBI VMALLE1IS.
>> >
>> > 	*** At this point, the page-table pointed to by the old table entry
>> > 	    is not reachable to the hardware walker ***
>> >
>> >       c. stage2_map_walk_leaf() is called for each leaf entry in the
>> >          now-unreachable subtree, dropping page-references for each valid
>> > 	entry it finds.
>> >       d. stage2_map_walk_table_post() is eventually called for the entry
>> >          which we cleared back in b.ii, so we install the new block mapping.
>> >
>> > You are proposing to add additional TLB invalidation to (c), but I don't
>> > think that is necessary, thanks to the invalidation already performed in
>> > b.iii. What am I missing here?
>> 
>> The point is at b.iii where the TLBI is not enough. There are many 
>> page
>> mappings that we need to merge into a block mapping.
>> 
>> We invalidate the TLB for the input address without level hint at 
>> b.iii, but
>> this operation just flush TLB for one page mapping, there
>> 
>> are still some TLB entries for the other page mappings in the cache, 
>> the MMU
>> hardware walker can still hit these entries next time.
> 
> Ah, yes, I see. Thanks. I hadn't considered the case where there are 
> table
> entries beneath the anchor. So how about the diff below?
> 
> Will
> 
> --->8
> 
> diff --git a/arch/arm64/kvm/hyp/pgtable.c 
> b/arch/arm64/kvm/hyp/pgtable.c
> index 0271b4a3b9fe..12526d8c7ae4 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -493,7 +493,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64
> end, u32 level,
>  		return 0;
> 
>  	kvm_set_invalid_pte(ptep);
> -	kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, 0);
> +	/* TLB invalidation is deferred until the _post handler */
>  	data->anchor = ptep;
>  	return 0;
>  }
> @@ -547,11 +547,21 @@ static int stage2_map_walk_table_post(u64 addr,
> u64 end, u32 level,
>  				      struct stage2_map_data *data)
>  {
>  	int ret = 0;
> +	kvm_pte_t pte = *ptep;
> 
>  	if (!data->anchor)
>  		return 0;
> 
> -	free_page((unsigned long)kvm_pte_follow(*ptep));
> +	kvm_set_invalid_pte(ptep);
> +
> +	/*
> +	 * Invalidate the whole stage-2, as we may have numerous leaf
> +	 * entries below us which would otherwise need invalidating
> +	 * individually.
> +	 */
> +	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);

That's a big hammer, and we so far have been pretty careful not to
over-invalidate. Is the block-replacing-table *without* an unmap
in between the only case where this triggers?

Thanks,

         M.
Yanan Wang Dec. 1, 2020, 2:11 p.m. UTC | #7
On 2020/12/1 21:46, Will Deacon wrote:
> On Tue, Dec 01, 2020 at 10:30:41AM +0800, wangyanan (Y) wrote:
>> On 2020/12/1 0:01, Will Deacon wrote:
>>> On Mon, Nov 30, 2020 at 11:24:19PM +0800, wangyanan (Y) wrote:
>>>> On 2020/11/30 21:34, Will Deacon wrote:
>>>>> On Mon, Nov 30, 2020 at 08:18:46PM +0800, Yanan Wang wrote:
>>>>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>>>>> index 696b6aa83faf..fec8dc9f2baa 100644
>>>>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>>>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>>>>> @@ -500,6 +500,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>>>>>>     	return 0;
>>>>>>     }
>>>>>> +static void stage2_flush_dcache(void *addr, u64 size);
>>>>>> +static bool stage2_pte_cacheable(kvm_pte_t pte);
>>>>>> +
>>>>>>     static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>>>>>     				struct stage2_map_data *data)
>>>>>>     {
>>>>>> @@ -507,9 +510,17 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>>>>>     	struct page *page = virt_to_page(ptep);
>>>>>>     	if (data->anchor) {
>>>>>> -		if (kvm_pte_valid(pte))
>>>>>> +		if (kvm_pte_valid(pte)) {
>>>>>> +			kvm_set_invalid_pte(ptep);
>>>>>> +			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
>>>>>> +				     addr, level);
>>>>>>     			put_page(page);
>>>>> This doesn't make sense to me: the page-table pages we're walking when the
>>>>> anchor is set are not accessible to the hardware walker because we unhooked
>>>>> the entire sub-table in stage2_map_walk_table_pre(), which has the necessary
>>>>> TLB invalidation.
>>>>>
>>>>> Are you seeing a problem in practice here?
>>>> Yes, I indeed find a problem in practice.
>>>>
>>>> When the migration was cancelled, a TLB conflic abort  was found in guest.
>>>>
>>>> This problem is fixed before rework of the page table code, you can have a
>>>> look in the following two links:
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c3736cd32bf5197aed1410ae826d2d254a5b277
>>>>
>>>> https://lists.cs.columbia.edu/pipermail/kvmarm/2019-March/035031.html
>>> Ok, let's go through this, because I still don't see the bug. Please correct
>>> me if you spot any mistakes:
>>>
>>>     1. We have a block mapping for X => Y
>>>     2. Dirty logging is enabled, so the block mapping is write-protected and
>>>        ends up being split into page mappings
>>>     3. Dirty logging is disabled due to a failed migration.
>>>
>>> --- At this point, I think we agree that the state of the MMU is alright ---
>>>
>>>     4. We take a stage-2 fault and want to reinstall the block mapping:
>>>
>>>        a. kvm_pgtable_stage2_map() is invoked to install the block mapping
>>>        b. stage2_map_walk_table_pre() finds a table where we would like to
>>>           install the block:
>>>
>>> 	i.   The anchor is set to point at this entry
>>> 	ii.  The entry is made invalid
>>> 	iii. We invalidate the TLB for the input address. This is
>>> 	     TLBI IPAS2SE1IS without level hint and then TLBI VMALLE1IS.
>>>
>>> 	*** At this point, the page-table pointed to by the old table entry
>>> 	    is not reachable to the hardware walker ***
>>>
>>>        c. stage2_map_walk_leaf() is called for each leaf entry in the
>>>           now-unreachable subtree, dropping page-references for each valid
>>> 	entry it finds.
>>>        d. stage2_map_walk_table_post() is eventually called for the entry
>>>           which we cleared back in b.ii, so we install the new block mapping.
>>>
>>> You are proposing to add additional TLB invalidation to (c), but I don't
>>> think that is necessary, thanks to the invalidation already performed in
>>> b.iii. What am I missing here?
>> The point is at b.iii where the TLBI is not enough. There are many page
>> mappings that we need to merge into a block mapping.
>>
>> We invalidate the TLB for the input address without level hint at b.iii, but
>> this operation just flush TLB for one page mapping, there
>>
>> are still some TLB entries for the other page mappings in the cache, the MMU
>> hardware walker can still hit these entries next time.
> Ah, yes, I see. Thanks. I hadn't considered the case where there are table
> entries beneath the anchor. So how about the diff below?
>
> Will
>
> --->8

Hi, I think it's inappropriate to put the TLBI of all the leaf entries 
in function stage2_map_walk_table_post(),

because the *ptep must be an upper table entry when we enter 
stage2_map_walk_table_post().

We should make the TLBI for every leaf entry not table entry in the last 
lookup level,  just as I am proposing

to add the additional TLBI in function stage2_map_walk_leaf().

Thanks.


Yanan

>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 0271b4a3b9fe..12526d8c7ae4 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -493,7 +493,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>   		return 0;
>   
>   	kvm_set_invalid_pte(ptep);
> -	kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, 0);
> +	/* TLB invalidation is deferred until the _post handler */
>   	data->anchor = ptep;
>   	return 0;
>   }
> @@ -547,11 +547,21 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>   				      struct stage2_map_data *data)
>   {
>   	int ret = 0;
> +	kvm_pte_t pte = *ptep;
>   
>   	if (!data->anchor)
>   		return 0;
>   
> -	free_page((unsigned long)kvm_pte_follow(*ptep));
> +	kvm_set_invalid_pte(ptep);
> +
> +	/*
> +	 * Invalidate the whole stage-2, as we may have numerous leaf
> +	 * entries below us which would otherwise need invalidating
> +	 * individually.
> +	 */
> +	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
> +
> +	free_page((unsigned long)kvm_pte_follow(pte));
>   	put_page(virt_to_page(ptep));
>   
>   	if (data->anchor == ptep) {
> .
Will Deacon Dec. 1, 2020, 2:23 p.m. UTC | #8
On Tue, Dec 01, 2020 at 02:05:03PM +0000, Marc Zyngier wrote:
> On 2020-12-01 13:46, Will Deacon wrote:
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 0271b4a3b9fe..12526d8c7ae4 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -493,7 +493,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64
> > end, u32 level,
> >  		return 0;
> > 
> >  	kvm_set_invalid_pte(ptep);
> > -	kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, 0);
> > +	/* TLB invalidation is deferred until the _post handler */
> >  	data->anchor = ptep;
> >  	return 0;
> >  }
> > @@ -547,11 +547,21 @@ static int stage2_map_walk_table_post(u64 addr,
> > u64 end, u32 level,
> >  				      struct stage2_map_data *data)
> >  {
> >  	int ret = 0;
> > +	kvm_pte_t pte = *ptep;
> > 
> >  	if (!data->anchor)
> >  		return 0;
> > 
> > -	free_page((unsigned long)kvm_pte_follow(*ptep));
> > +	kvm_set_invalid_pte(ptep);
> > +
> > +	/*
> > +	 * Invalidate the whole stage-2, as we may have numerous leaf
> > +	 * entries below us which would otherwise need invalidating
> > +	 * individually.
> > +	 */
> > +	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
> 
> That's a big hammer, and we so far have been pretty careful not to
> over-invalidate. Is the block-replacing-table *without* an unmap
> in between the only case where this triggers?

Yes, this only happens in that case. The alternative would be to issue
invalidations for every single entry we unmap, which I can implement if
you prefer, but it felt worse to me given that by-IPA invalidation
isn't really great either).

Will
Marc Zyngier Dec. 1, 2020, 2:32 p.m. UTC | #9
On 2020-12-01 14:23, Will Deacon wrote:
> On Tue, Dec 01, 2020 at 02:05:03PM +0000, Marc Zyngier wrote:
>> On 2020-12-01 13:46, Will Deacon wrote:
>> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>> > index 0271b4a3b9fe..12526d8c7ae4 100644
>> > --- a/arch/arm64/kvm/hyp/pgtable.c
>> > +++ b/arch/arm64/kvm/hyp/pgtable.c
>> > @@ -493,7 +493,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64
>> > end, u32 level,
>> >  		return 0;
>> >
>> >  	kvm_set_invalid_pte(ptep);
>> > -	kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, 0);
>> > +	/* TLB invalidation is deferred until the _post handler */
>> >  	data->anchor = ptep;
>> >  	return 0;
>> >  }
>> > @@ -547,11 +547,21 @@ static int stage2_map_walk_table_post(u64 addr,
>> > u64 end, u32 level,
>> >  				      struct stage2_map_data *data)
>> >  {
>> >  	int ret = 0;
>> > +	kvm_pte_t pte = *ptep;
>> >
>> >  	if (!data->anchor)
>> >  		return 0;
>> >
>> > -	free_page((unsigned long)kvm_pte_follow(*ptep));
>> > +	kvm_set_invalid_pte(ptep);
>> > +
>> > +	/*
>> > +	 * Invalidate the whole stage-2, as we may have numerous leaf
>> > +	 * entries below us which would otherwise need invalidating
>> > +	 * individually.
>> > +	 */
>> > +	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>> 
>> That's a big hammer, and we so far have been pretty careful not to
>> over-invalidate. Is the block-replacing-table *without* an unmap
>> in between the only case where this triggers?
> 
> Yes, this only happens in that case. The alternative would be to issue
> invalidations for every single entry we unmap, which I can implement if
> you prefer, but it felt worse to me given that by-IPA invalidation
> isn't really great either).

Right. If that's the only case where this happens, I'm not too bothered.
What I want to make sure is that the normal table->block upgrade path
(which goes via a MMU notifier that unmaps the table) stays relatively
quick and doesn't suffer from the above.

It looks like Yanan still has some concerns though, and I'd like to
understand what they are.

Thanks,

         M.
Marc Zyngier Dec. 1, 2020, 2:35 p.m. UTC | #10
Hi Yanan,

On 2020-12-01 14:11, wangyanan (Y) wrote:
> On 2020/12/1 21:46, Will Deacon wrote:
>> On Tue, Dec 01, 2020 at 10:30:41AM +0800, wangyanan (Y) wrote:

[...]

>>> The point is at b.iii where the TLBI is not enough. There are many 
>>> page
>>> mappings that we need to merge into a block mapping.
>>> 
>>> We invalidate the TLB for the input address without level hint at 
>>> b.iii, but
>>> this operation just flush TLB for one page mapping, there
>>> 
>>> are still some TLB entries for the other page mappings in the cache, 
>>> the MMU
>>> hardware walker can still hit these entries next time.
>> Ah, yes, I see. Thanks. I hadn't considered the case where there are 
>> table
>> entries beneath the anchor. So how about the diff below?
>> 
>> Will
>> 
>> --->8
> 
> Hi, I think it's inappropriate to put the TLBI of all the leaf entries
> in function stage2_map_walk_table_post(),
> 
> because the *ptep must be an upper table entry when we enter
> stage2_map_walk_table_post().
> 
> We should make the TLBI for every leaf entry not table entry in the
> last lookup level,  just as I am proposing
> 
> to add the additional TLBI in function stage2_map_walk_leaf().

Could you make your concerns explicit? As far as I can tell, this should
address the bug you found, at least from a correctness perspective.

Are you worried about the impact of the full S2 invalidation? Or do you
see another correctness issue?

Thanks,

         M.
Yanan Wang Dec. 1, 2020, 5:20 p.m. UTC | #11
On 2020/12/1 22:35, Marc Zyngier wrote:

> Hi Yanan,
>
> On 2020-12-01 14:11, wangyanan (Y) wrote:
>> On 2020/12/1 21:46, Will Deacon wrote:
>>> On Tue, Dec 01, 2020 at 10:30:41AM +0800, wangyanan (Y) wrote:
>
> [...]
>
>>>> The point is at b.iii where the TLBI is not enough. There are many 
>>>> page
>>>> mappings that we need to merge into a block mapping.
>>>>
>>>> We invalidate the TLB for the input address without level hint at 
>>>> b.iii, but
>>>> this operation just flush TLB for one page mapping, there
>>>>
>>>> are still some TLB entries for the other page mappings in the 
>>>> cache, the MMU
>>>> hardware walker can still hit these entries next time.
>>> Ah, yes, I see. Thanks. I hadn't considered the case where there are 
>>> table
>>> entries beneath the anchor. So how about the diff below?
>>>
>>> Will
>>>
>>> --->8
>>
>> Hi, I think it's inappropriate to put the TLBI of all the leaf entries
>> in function stage2_map_walk_table_post(),
>>
>> because the *ptep must be an upper table entry when we enter
>> stage2_map_walk_table_post().
>>
>> We should make the TLBI for every leaf entry not table entry in the
>> last lookup level,  just as I am proposing
>>
>> to add the additional TLBI in function stage2_map_walk_leaf().
>
> Could you make your concerns explicit? As far as I can tell, this should
> address the bug you found, at least from a correctness perspective.
>
> Are you worried about the impact of the full S2 invalidation? Or do you
> see another correctness issue?


Hi Will, Marc,


After recheck of the diff, the full S2 invalidation in 
stage2_map_walk_table_post() should be well enough to solve this problem.

But I was wondering if we can add the full S2 invalidation in 
stage2_map_walk_table_pre(), where __kvm_tlb_flush_vmid() will be called 
for only one time.

If we add the full TLBI in stage2_map_walk_table_post(), 
__kvm_tlb_flush_vmid() might be called for many times in the loop and 
lots of (unnecessary) CPU instructions will be wasted.

What I'm saying is something like below, please let me know what do you 
think.

If this is OK, I can update the diff in v2 and send it with your SOB (is 
it appropriate?) after some tests.


diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index b232bdd142a6..f11fb2996080 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -496,7 +496,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64 
end, u32 level,
                 return 0;

         kvm_set_invalid_pte(ptep);
-       kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, 0);
+       kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
         data->anchor = ptep;
         return 0;
  }


Thanks,

Yanan
Will Deacon Dec. 1, 2020, 6:17 p.m. UTC | #12
On Wed, Dec 02, 2020 at 01:20:33AM +0800, wangyanan (Y) wrote:
> On 2020/12/1 22:35, Marc Zyngier wrote:
> 
> > Hi Yanan,
> > 
> > On 2020-12-01 14:11, wangyanan (Y) wrote:
> > > On 2020/12/1 21:46, Will Deacon wrote:
> > > > On Tue, Dec 01, 2020 at 10:30:41AM +0800, wangyanan (Y) wrote:
> > 
> > [...]
> > 
> > > > > The point is at b.iii where the TLBI is not enough. There
> > > > > are many page
> > > > > mappings that we need to merge into a block mapping.
> > > > > 
> > > > > We invalidate the TLB for the input address without level
> > > > > hint at b.iii, but
> > > > > this operation just flush TLB for one page mapping, there
> > > > > 
> > > > > are still some TLB entries for the other page mappings in
> > > > > the cache, the MMU
> > > > > hardware walker can still hit these entries next time.
> > > > Ah, yes, I see. Thanks. I hadn't considered the case where there
> > > > are table
> > > > entries beneath the anchor. So how about the diff below?
> > > > 
> > > > Will
> > > > 
> > > > --->8
> > > 
> > > Hi, I think it's inappropriate to put the TLBI of all the leaf entries
> > > in function stage2_map_walk_table_post(),
> > > 
> > > because the *ptep must be an upper table entry when we enter
> > > stage2_map_walk_table_post().
> > > 
> > > We should make the TLBI for every leaf entry not table entry in the
> > > last lookup level,  just as I am proposing
> > > 
> > > to add the additional TLBI in function stage2_map_walk_leaf().
> > 
> > Could you make your concerns explicit? As far as I can tell, this should
> > address the bug you found, at least from a correctness perspective.
> > 
> > Are you worried about the impact of the full S2 invalidation? Or do you
> > see another correctness issue?
> 
> 
> Hi Will, Marc,
> 
> 
> After recheck of the diff, the full S2 invalidation in
> stage2_map_walk_table_post() should be well enough to solve this problem.
> 
> But I was wondering if we can add the full S2 invalidation in
> stage2_map_walk_table_pre(), where __kvm_tlb_flush_vmid() will be called for
> only one time.
> 
> If we add the full TLBI in stage2_map_walk_table_post(),
> __kvm_tlb_flush_vmid() might be called for many times in the loop and lots
> of (unnecessary) CPU instructions will be wasted.
> 
> What I'm saying is something like below, please let me know what do you
> think.
> 
> If this is OK, I can update the diff in v2 and send it with your SOB (is it
> appropriate?) after some tests.
> 
> 
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index b232bdd142a6..f11fb2996080 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -496,7 +496,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end,
> u32 level,
>                 return 0;
> 
>         kvm_set_invalid_pte(ptep);
> -       kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, 0);
> +       kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>         data->anchor = ptep;
>         return 0;

Yes, I think that's much better, but please add a comment! (you can
probably more-or-less copy the one I had in the post handler)

Will
diff mbox series

Patch

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 696b6aa83faf..fec8dc9f2baa 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -500,6 +500,9 @@  static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 	return 0;
 }
 
+static void stage2_flush_dcache(void *addr, u64 size);
+static bool stage2_pte_cacheable(kvm_pte_t pte);
+
 static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 				struct stage2_map_data *data)
 {
@@ -507,9 +510,17 @@  static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	struct page *page = virt_to_page(ptep);
 
 	if (data->anchor) {
-		if (kvm_pte_valid(pte))
+		if (kvm_pte_valid(pte)) {
+			kvm_set_invalid_pte(ptep);
+			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
+				     addr, level);
 			put_page(page);
 
+			if (stage2_pte_cacheable(pte))
+				stage2_flush_dcache(kvm_pte_follow(pte),
+						    kvm_granule_size(level));
+		}
+
 		return 0;
 	}
 
@@ -574,7 +585,7 @@  static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
  * The behaviour of the LEAF callback then depends on whether or not the
  * anchor has been set. If not, then we're not using a block mapping higher
  * up the table and we perform the mapping at the existing leaves instead.
- * If, on the other hand, the anchor _is_ set, then we drop references to
+ * If, on the other hand, the anchor _is_ set, then we unmap the mapping of
  * all valid leaves so that the pages beneath the anchor can be freed.
  *
  * Finally, the TABLE_POST callback does nothing if the anchor has not