diff mbox series

[v4,1/2] kvm/arm64: Remove the creation time's mapping of MMIO regions

Message ID 20210415140328.24200-2-zhukeqian1@huawei.com (mailing list archive)
State New
Headers show
Series kvm/arm64: Try stage2 block mapping for host device MMIO | expand

Commit Message

Keqian Zhu April 15, 2021, 2:03 p.m. UTC
The MMIO regions may be unmapped for many reasons and can be remapped
by stage2 fault path. Map MMIO regions at creation time becomes a
minor optimization and makes these two mapping path hard to sync.

Remove the mapping code while keep the useful sanity check.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
---
 arch/arm64/kvm/mmu.c | 38 +++-----------------------------------
 1 file changed, 3 insertions(+), 35 deletions(-)

Comments

Keqian Zhu April 21, 2021, 6:28 a.m. UTC | #1
Hi Gavin,

On 2021/4/21 14:38, Gavin Shan wrote:
> Hi Keqian,
> 
> On 4/16/21 12:03 AM, Keqian Zhu wrote:
>> The MMIO regions may be unmapped for many reasons and can be remapped
>> by stage2 fault path. Map MMIO regions at creation time becomes a
>> minor optimization and makes these two mapping path hard to sync.
>>
>> Remove the mapping code while keep the useful sanity check.
>>
>> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
>> ---
>>   arch/arm64/kvm/mmu.c | 38 +++-----------------------------------
>>   1 file changed, 3 insertions(+), 35 deletions(-)
>>
> 
> After removing the logic to create stage2 mapping for VM_PFNMAP region,
> I think the "do { } while" loop becomes unnecessary and can be dropped
> completely. It means the only sanity check is to see if the memory slot
> overflows IPA space or not. In that case, KVM_MR_FLAGS_ONLY can be
> ignored because the memory slot's base address and length aren't changed
> when we have KVM_MR_FLAGS_ONLY.
Maybe not exactly. Here we do an important sanity check that we shouldn't
log dirty for memslots with VM_PFNMAP.


> 
> It seems the patch isn't based on "next" branch because find_vma() was
> replaced by find_vma_intersection() by one of my patches :)
Yep, I remember it. I will replace it at next merge window...

Thanks,
Keqian

> 
> Thanks,
> Gavin
> 
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 8711894db8c2..c59af5ca01b0 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -1301,7 +1301,6 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>>   {
>>       hva_t hva = mem->userspace_addr;
>>       hva_t reg_end = hva + mem->memory_size;
>> -    bool writable = !(mem->flags & KVM_MEM_READONLY);
>>       int ret = 0;
>>         if (change != KVM_MR_CREATE && change != KVM_MR_MOVE &&
>> @@ -1318,8 +1317,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>>       mmap_read_lock(current->mm);
>>       /*
>>        * A memory region could potentially cover multiple VMAs, and any holes
>> -     * between them, so iterate over all of them to find out if we can map
>> -     * any of them right now.
>> +     * between them, so iterate over all of them.
>>        *
>>        *     +--------------------------------------------+
>>        * +---------------+----------------+   +----------------+
>> @@ -1330,50 +1328,20 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>>        */
>>       do {
>>           struct vm_area_struct *vma = find_vma(current->mm, hva);
>> -        hva_t vm_start, vm_end;
>>             if (!vma || vma->vm_start >= reg_end)
>>               break;
>>   -        /*
>> -         * Take the intersection of this VMA with the memory region
>> -         */
>> -        vm_start = max(hva, vma->vm_start);
>> -        vm_end = min(reg_end, vma->vm_end);
>> -
>>           if (vma->vm_flags & VM_PFNMAP) {
>> -            gpa_t gpa = mem->guest_phys_addr +
>> -                    (vm_start - mem->userspace_addr);
>> -            phys_addr_t pa;
>> -
>> -            pa = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT;
>> -            pa += vm_start - vma->vm_start;
>> -
>>               /* IO region dirty page logging not allowed */
>>               if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES) {
>>                   ret = -EINVAL;
>> -                goto out;
>> -            }
>> -
>> -            ret = kvm_phys_addr_ioremap(kvm, gpa, pa,
>> -                            vm_end - vm_start,
>> -                            writable);
>> -            if (ret)
>>                   break;
>> +            }
>>           }
>> -        hva = vm_end;
>> +        hva = min(reg_end, vma->vm_end);
>>       } while (hva < reg_end);
>>   -    if (change == KVM_MR_FLAGS_ONLY)
>> -        goto out;
>> -
>> -    spin_lock(&kvm->mmu_lock);
>> -    if (ret)
>> -        unmap_stage2_range(&kvm->arch.mmu, mem->guest_phys_addr, mem->memory_size);
>> -    else if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
>> -        stage2_flush_memslot(kvm, memslot);
>> -    spin_unlock(&kvm->mmu_lock);
>> -out:
>>       mmap_read_unlock(current->mm);
>>       return ret;
>>   }
>>
> 
> .
>
Gavin Shan April 21, 2021, 6:38 a.m. UTC | #2
Hi Keqian,

On 4/16/21 12:03 AM, Keqian Zhu wrote:
> The MMIO regions may be unmapped for many reasons and can be remapped
> by stage2 fault path. Map MMIO regions at creation time becomes a
> minor optimization and makes these two mapping path hard to sync.
> 
> Remove the mapping code while keep the useful sanity check.
> 
> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> ---
>   arch/arm64/kvm/mmu.c | 38 +++-----------------------------------
>   1 file changed, 3 insertions(+), 35 deletions(-)
> 

After removing the logic to create stage2 mapping for VM_PFNMAP region,
I think the "do { } while" loop becomes unnecessary and can be dropped
completely. It means the only sanity check is to see if the memory slot
overflows IPA space or not. In that case, KVM_MR_FLAGS_ONLY can be
ignored because the memory slot's base address and length aren't changed
when we have KVM_MR_FLAGS_ONLY.

It seems the patch isn't based on "next" branch because find_vma() was
replaced by find_vma_intersection() by one of my patches :)

Thanks,
Gavin

> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 8711894db8c2..c59af5ca01b0 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1301,7 +1301,6 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>   {
>   	hva_t hva = mem->userspace_addr;
>   	hva_t reg_end = hva + mem->memory_size;
> -	bool writable = !(mem->flags & KVM_MEM_READONLY);
>   	int ret = 0;
>   
>   	if (change != KVM_MR_CREATE && change != KVM_MR_MOVE &&
> @@ -1318,8 +1317,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>   	mmap_read_lock(current->mm);
>   	/*
>   	 * A memory region could potentially cover multiple VMAs, and any holes
> -	 * between them, so iterate over all of them to find out if we can map
> -	 * any of them right now.
> +	 * between them, so iterate over all of them.
>   	 *
>   	 *     +--------------------------------------------+
>   	 * +---------------+----------------+   +----------------+
> @@ -1330,50 +1328,20 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>   	 */
>   	do {
>   		struct vm_area_struct *vma = find_vma(current->mm, hva);
> -		hva_t vm_start, vm_end;
>   
>   		if (!vma || vma->vm_start >= reg_end)
>   			break;
>   
> -		/*
> -		 * Take the intersection of this VMA with the memory region
> -		 */
> -		vm_start = max(hva, vma->vm_start);
> -		vm_end = min(reg_end, vma->vm_end);
> -
>   		if (vma->vm_flags & VM_PFNMAP) {
> -			gpa_t gpa = mem->guest_phys_addr +
> -				    (vm_start - mem->userspace_addr);
> -			phys_addr_t pa;
> -
> -			pa = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT;
> -			pa += vm_start - vma->vm_start;
> -
>   			/* IO region dirty page logging not allowed */
>   			if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES) {
>   				ret = -EINVAL;
> -				goto out;
> -			}
> -
> -			ret = kvm_phys_addr_ioremap(kvm, gpa, pa,
> -						    vm_end - vm_start,
> -						    writable);
> -			if (ret)
>   				break;
> +			}
>   		}
> -		hva = vm_end;
> +		hva = min(reg_end, vma->vm_end);
>   	} while (hva < reg_end);
>   
> -	if (change == KVM_MR_FLAGS_ONLY)
> -		goto out;
> -
> -	spin_lock(&kvm->mmu_lock);
> -	if (ret)
> -		unmap_stage2_range(&kvm->arch.mmu, mem->guest_phys_addr, mem->memory_size);
> -	else if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
> -		stage2_flush_memslot(kvm, memslot);
> -	spin_unlock(&kvm->mmu_lock);
> -out:
>   	mmap_read_unlock(current->mm);
>   	return ret;
>   }
>
Gavin Shan April 22, 2021, 2:12 a.m. UTC | #3
Hi Keqian,

On 4/21/21 4:28 PM, Keqian Zhu wrote:
> On 2021/4/21 14:38, Gavin Shan wrote:
>> On 4/16/21 12:03 AM, Keqian Zhu wrote:
>>> The MMIO regions may be unmapped for many reasons and can be remapped
>>> by stage2 fault path. Map MMIO regions at creation time becomes a
>>> minor optimization and makes these two mapping path hard to sync.
>>>
>>> Remove the mapping code while keep the useful sanity check.
>>>
>>> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
>>> ---
>>>    arch/arm64/kvm/mmu.c | 38 +++-----------------------------------
>>>    1 file changed, 3 insertions(+), 35 deletions(-)
>>>
>>
>> After removing the logic to create stage2 mapping for VM_PFNMAP region,
>> I think the "do { } while" loop becomes unnecessary and can be dropped
>> completely. It means the only sanity check is to see if the memory slot
>> overflows IPA space or not. In that case, KVM_MR_FLAGS_ONLY can be
>> ignored because the memory slot's base address and length aren't changed
>> when we have KVM_MR_FLAGS_ONLY.
> Maybe not exactly. Here we do an important sanity check that we shouldn't
> log dirty for memslots with VM_PFNMAP.
> 

Yeah, Sorry that I missed that part. Something associated with Santosh's
patch. The flag can be not existing until the page fault happened on
the vma. In this case, the check could be not working properly.

   [PATCH] KVM: arm64: Correctly handle the mmio faulting

[...]

Thanks,
Gavin

>>
>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>> index 8711894db8c2..c59af5ca01b0 100644
>>> --- a/arch/arm64/kvm/mmu.c
>>> +++ b/arch/arm64/kvm/mmu.c
>>> @@ -1301,7 +1301,6 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>>>    {
>>>        hva_t hva = mem->userspace_addr;
>>>        hva_t reg_end = hva + mem->memory_size;
>>> -    bool writable = !(mem->flags & KVM_MEM_READONLY);
>>>        int ret = 0;
>>>          if (change != KVM_MR_CREATE && change != KVM_MR_MOVE &&
>>> @@ -1318,8 +1317,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>>>        mmap_read_lock(current->mm);
>>>        /*
>>>         * A memory region could potentially cover multiple VMAs, and any holes
>>> -     * between them, so iterate over all of them to find out if we can map
>>> -     * any of them right now.
>>> +     * between them, so iterate over all of them.
>>>         *
>>>         *     +--------------------------------------------+
>>>         * +---------------+----------------+   +----------------+
>>> @@ -1330,50 +1328,20 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>>>         */
>>>        do {
>>>            struct vm_area_struct *vma = find_vma(current->mm, hva);
>>> -        hva_t vm_start, vm_end;
>>>              if (!vma || vma->vm_start >= reg_end)
>>>                break;
>>>    -        /*
>>> -         * Take the intersection of this VMA with the memory region
>>> -         */
>>> -        vm_start = max(hva, vma->vm_start);
>>> -        vm_end = min(reg_end, vma->vm_end);
>>> -
>>>            if (vma->vm_flags & VM_PFNMAP) {
>>> -            gpa_t gpa = mem->guest_phys_addr +
>>> -                    (vm_start - mem->userspace_addr);
>>> -            phys_addr_t pa;
>>> -
>>> -            pa = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT;
>>> -            pa += vm_start - vma->vm_start;
>>> -
>>>                /* IO region dirty page logging not allowed */
>>>                if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES) {
>>>                    ret = -EINVAL;
>>> -                goto out;
>>> -            }
>>> -
>>> -            ret = kvm_phys_addr_ioremap(kvm, gpa, pa,
>>> -                            vm_end - vm_start,
>>> -                            writable);
>>> -            if (ret)
>>>                    break;
>>> +            }
>>>            }
>>> -        hva = vm_end;
>>> +        hva = min(reg_end, vma->vm_end);
>>>        } while (hva < reg_end);
>>>    -    if (change == KVM_MR_FLAGS_ONLY)
>>> -        goto out;
>>> -
>>> -    spin_lock(&kvm->mmu_lock);
>>> -    if (ret)
>>> -        unmap_stage2_range(&kvm->arch.mmu, mem->guest_phys_addr, mem->memory_size);
>>> -    else if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
>>> -        stage2_flush_memslot(kvm, memslot);
>>> -    spin_unlock(&kvm->mmu_lock);
>>> -out:
>>>        mmap_read_unlock(current->mm);
>>>        return ret;
>>>    }
>>>
>>
>> .
>>
>
Keqian Zhu April 22, 2021, 7:41 a.m. UTC | #4
Hi Gavin,

On 2021/4/22 10:12, Gavin Shan wrote:
> Hi Keqian,
> 
> On 4/21/21 4:28 PM, Keqian Zhu wrote:
>> On 2021/4/21 14:38, Gavin Shan wrote:
>>> On 4/16/21 12:03 AM, Keqian Zhu wrote:
>>>> The MMIO regions may be unmapped for many reasons and can be remapped
>>>> by stage2 fault path. Map MMIO regions at creation time becomes a
>>>> minor optimization and makes these two mapping path hard to sync.
>>>>
>>>> Remove the mapping code while keep the useful sanity check.
>>>>
>>>> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
>>>> ---
>>>>    arch/arm64/kvm/mmu.c | 38 +++-----------------------------------
>>>>    1 file changed, 3 insertions(+), 35 deletions(-)
>>>>
>>>
>>> After removing the logic to create stage2 mapping for VM_PFNMAP region,
>>> I think the "do { } while" loop becomes unnecessary and can be dropped
>>> completely. It means the only sanity check is to see if the memory slot
>>> overflows IPA space or not. In that case, KVM_MR_FLAGS_ONLY can be
>>> ignored because the memory slot's base address and length aren't changed
>>> when we have KVM_MR_FLAGS_ONLY.
>> Maybe not exactly. Here we do an important sanity check that we shouldn't
>> log dirty for memslots with VM_PFNMAP.
>>
> 
> Yeah, Sorry that I missed that part. Something associated with Santosh's
> patch. The flag can be not existing until the page fault happened on
> the vma. In this case, the check could be not working properly.
> 
>   [PATCH] KVM: arm64: Correctly handle the mmio faulting
Yeah, you are right.

If that happens, we won't try to use block mapping for memslot with VM_PFNMAP.
But it keeps a same logic with old code.

1. When without dirty-logging, we won't try block mapping for it, and we'll
finally know that it's device, so won't try to do adjust THP (Transparent Huge Page)
for it.
2. If userspace wrongly enables dirty logging for this memslot, we'll force_pte for it.

Thanks,
Keqian
Gavin Shan April 23, 2021, 12:36 a.m. UTC | #5
On 4/16/21 12:03 AM, Keqian Zhu wrote:
> The MMIO regions may be unmapped for many reasons and can be remapped
> by stage2 fault path. Map MMIO regions at creation time becomes a
> minor optimization and makes these two mapping path hard to sync.
> 
> Remove the mapping code while keep the useful sanity check.
> 
> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> ---
>   arch/arm64/kvm/mmu.c | 38 +++-----------------------------------
>   1 file changed, 3 insertions(+), 35 deletions(-)
>

Reviewed-by: Gavin Shan <gshan@redhat.com>

  
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 8711894db8c2..c59af5ca01b0 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1301,7 +1301,6 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>   {
>   	hva_t hva = mem->userspace_addr;
>   	hva_t reg_end = hva + mem->memory_size;
> -	bool writable = !(mem->flags & KVM_MEM_READONLY);
>   	int ret = 0;
>   
>   	if (change != KVM_MR_CREATE && change != KVM_MR_MOVE &&
> @@ -1318,8 +1317,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>   	mmap_read_lock(current->mm);
>   	/*
>   	 * A memory region could potentially cover multiple VMAs, and any holes
> -	 * between them, so iterate over all of them to find out if we can map
> -	 * any of them right now.
> +	 * between them, so iterate over all of them.
>   	 *
>   	 *     +--------------------------------------------+
>   	 * +---------------+----------------+   +----------------+
> @@ -1330,50 +1328,20 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>   	 */
>   	do {
>   		struct vm_area_struct *vma = find_vma(current->mm, hva);
> -		hva_t vm_start, vm_end;
>   
>   		if (!vma || vma->vm_start >= reg_end)
>   			break;
>   
> -		/*
> -		 * Take the intersection of this VMA with the memory region
> -		 */
> -		vm_start = max(hva, vma->vm_start);
> -		vm_end = min(reg_end, vma->vm_end);
> -
>   		if (vma->vm_flags & VM_PFNMAP) {
> -			gpa_t gpa = mem->guest_phys_addr +
> -				    (vm_start - mem->userspace_addr);
> -			phys_addr_t pa;
> -
> -			pa = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT;
> -			pa += vm_start - vma->vm_start;
> -
>   			/* IO region dirty page logging not allowed */
>   			if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES) {
>   				ret = -EINVAL;
> -				goto out;
> -			}
> -
> -			ret = kvm_phys_addr_ioremap(kvm, gpa, pa,
> -						    vm_end - vm_start,
> -						    writable);
> -			if (ret)
>   				break;
> +			}
>   		}
> -		hva = vm_end;
> +		hva = min(reg_end, vma->vm_end);
>   	} while (hva < reg_end);
>   
> -	if (change == KVM_MR_FLAGS_ONLY)
> -		goto out;
> -
> -	spin_lock(&kvm->mmu_lock);
> -	if (ret)
> -		unmap_stage2_range(&kvm->arch.mmu, mem->guest_phys_addr, mem->memory_size);
> -	else if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
> -		stage2_flush_memslot(kvm, memslot);
> -	spin_unlock(&kvm->mmu_lock);
> -out:
>   	mmap_read_unlock(current->mm);
>   	return ret;
>   }
>
Gavin Shan April 23, 2021, 1:35 a.m. UTC | #6
Hi Keqian,

On 4/22/21 5:41 PM, Keqian Zhu wrote:
> On 2021/4/22 10:12, Gavin Shan wrote:
>> On 4/21/21 4:28 PM, Keqian Zhu wrote:
>>> On 2021/4/21 14:38, Gavin Shan wrote:
>>>> On 4/16/21 12:03 AM, Keqian Zhu wrote:

[...]

>>
>> Yeah, Sorry that I missed that part. Something associated with Santosh's
>> patch. The flag can be not existing until the page fault happened on
>> the vma. In this case, the check could be not working properly.
>>
>>    [PATCH] KVM: arm64: Correctly handle the mmio faulting
> Yeah, you are right.
> 
> If that happens, we won't try to use block mapping for memslot with VM_PFNMAP.
> But it keeps a same logic with old code.
> 
> 1. When without dirty-logging, we won't try block mapping for it, and we'll
> finally know that it's device, so won't try to do adjust THP (Transparent Huge Page)
> for it.
> 2. If userspace wrongly enables dirty logging for this memslot, we'll force_pte for it.
> 

It's not about the patch itself and just want more discussion to get more details.
The patch itself looks good to me. I got two questions as below:

(1) The memslot fails to be added if it's backed by MMIO region and dirty logging is
enabled in kvm_arch_prepare_memory_region(). As Santosh reported, the corresponding
vma could be associated with MMIO region and VM_PFNMAP is missed. In this case,
kvm_arch_prepare_memory_region() isn't returning error, meaning the memslot can be
added successfully and block mapping isn't used, as you mentioned. The question is
the memslot is added, but the expected result would be failure.

(2) If dirty logging is enabled on the MMIO memslot, everything should be fine. If
the dirty logging isn't enabled and VM_PFNMAP isn't set yet in user_mem_abort(),
block mapping won't be used and PAGE_SIZE is picked, but the failing IPA might
be good candidate for block mapping. It means we miss something for blocking
mapping?

By the way, do you have idea why dirty logging can't be enabled on MMIO memslot?
I guess Marc might know the history. For example, QEMU is taking "/dev/mem" or
"/dev/kmem" to back guest's memory, the vma is marked as MMIO, but dirty logging
and migration isn't supported?

Thanks,
Gavin
Keqian Zhu April 23, 2021, 1:36 a.m. UTC | #7
Hi Gavin,

On 2021/4/23 9:35, Gavin Shan wrote:
> Hi Keqian,
> 
> On 4/22/21 5:41 PM, Keqian Zhu wrote:
>> On 2021/4/22 10:12, Gavin Shan wrote:
>>> On 4/21/21 4:28 PM, Keqian Zhu wrote:
>>>> On 2021/4/21 14:38, Gavin Shan wrote:
>>>>> On 4/16/21 12:03 AM, Keqian Zhu wrote:
> 
> [...]
> 
>>>
>>> Yeah, Sorry that I missed that part. Something associated with Santosh's
>>> patch. The flag can be not existing until the page fault happened on
>>> the vma. In this case, the check could be not working properly.
>>>
>>>    [PATCH] KVM: arm64: Correctly handle the mmio faulting
>> Yeah, you are right.
>>
>> If that happens, we won't try to use block mapping for memslot with VM_PFNMAP.
>> But it keeps a same logic with old code.
>>
>> 1. When without dirty-logging, we won't try block mapping for it, and we'll
>> finally know that it's device, so won't try to do adjust THP (Transparent Huge Page)
>> for it.
>> 2. If userspace wrongly enables dirty logging for this memslot, we'll force_pte for it.
>>
> 
> It's not about the patch itself and just want more discussion to get more details.
> The patch itself looks good to me. I got two questions as below:
> 
> (1) The memslot fails to be added if it's backed by MMIO region and dirty logging is
> enabled in kvm_arch_prepare_memory_region(). As Santosh reported, the corresponding
> vma could be associated with MMIO region and VM_PFNMAP is missed. In this case,
> kvm_arch_prepare_memory_region() isn't returning error, meaning the memslot can be
> added successfully and block mapping isn't used, as you mentioned. The question is
> the memslot is added, but the expected result would be failure.
Sure. I think we could try to populate the final flag of vma in kvm_arch_prepare_memory_region().
Maybe through GUP or any better method? It's nice if you can try to solve this. :)

> 
> (2) If dirty logging is enabled on the MMIO memslot, everything should be fine. If
> the dirty logging isn't enabled and VM_PFNMAP isn't set yet in user_mem_abort(),
> block mapping won't be used and PAGE_SIZE is picked, but the failing IPA might
> be good candidate for block mapping. It means we miss something for blocking
> mapping?
Right. This issue also can be solved by populating the final flag of vma in kvm_arch_prepare_memory_region().


> 
> By the way, do you have idea why dirty logging can't be enabled on MMIO memslot?
IIUC, MMIO region is of device memory type, it's associated with device state and action.
For normal memory type, we can write it out-of-order and repeatedly, but for device memory
type, we can't do that. The write to MMIO will trigger device action based on current device
state, also what we can read from MMIO based on current device state. Thus the policy of
dirty logging for normal memory can't be applied to MMIO.



> I guess Marc might know the history. For example, QEMU is taking "/dev/mem" or
> "/dev/kmem" to back guest's memory, the vma is marked as MMIO, but dirty logging
> and migration isn't supported?
The MMIO region is a part of device state. We need extra kernel driver to support migration
of pass-through device, as how to save and restore the device state is closely related to
a specific type of device. You can refer VFIO migration for more detail.

Thanks,
Keqian
diff mbox series

Patch

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8711894db8c2..c59af5ca01b0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1301,7 +1301,6 @@  int kvm_arch_prepare_memory_region(struct kvm *kvm,
 {
 	hva_t hva = mem->userspace_addr;
 	hva_t reg_end = hva + mem->memory_size;
-	bool writable = !(mem->flags & KVM_MEM_READONLY);
 	int ret = 0;
 
 	if (change != KVM_MR_CREATE && change != KVM_MR_MOVE &&
@@ -1318,8 +1317,7 @@  int kvm_arch_prepare_memory_region(struct kvm *kvm,
 	mmap_read_lock(current->mm);
 	/*
 	 * A memory region could potentially cover multiple VMAs, and any holes
-	 * between them, so iterate over all of them to find out if we can map
-	 * any of them right now.
+	 * between them, so iterate over all of them.
 	 *
 	 *     +--------------------------------------------+
 	 * +---------------+----------------+   +----------------+
@@ -1330,50 +1328,20 @@  int kvm_arch_prepare_memory_region(struct kvm *kvm,
 	 */
 	do {
 		struct vm_area_struct *vma = find_vma(current->mm, hva);
-		hva_t vm_start, vm_end;
 
 		if (!vma || vma->vm_start >= reg_end)
 			break;
 
-		/*
-		 * Take the intersection of this VMA with the memory region
-		 */
-		vm_start = max(hva, vma->vm_start);
-		vm_end = min(reg_end, vma->vm_end);
-
 		if (vma->vm_flags & VM_PFNMAP) {
-			gpa_t gpa = mem->guest_phys_addr +
-				    (vm_start - mem->userspace_addr);
-			phys_addr_t pa;
-
-			pa = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT;
-			pa += vm_start - vma->vm_start;
-
 			/* IO region dirty page logging not allowed */
 			if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES) {
 				ret = -EINVAL;
-				goto out;
-			}
-
-			ret = kvm_phys_addr_ioremap(kvm, gpa, pa,
-						    vm_end - vm_start,
-						    writable);
-			if (ret)
 				break;
+			}
 		}
-		hva = vm_end;
+		hva = min(reg_end, vma->vm_end);
 	} while (hva < reg_end);
 
-	if (change == KVM_MR_FLAGS_ONLY)
-		goto out;
-
-	spin_lock(&kvm->mmu_lock);
-	if (ret)
-		unmap_stage2_range(&kvm->arch.mmu, mem->guest_phys_addr, mem->memory_size);
-	else if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
-		stage2_flush_memslot(kvm, memslot);
-	spin_unlock(&kvm->mmu_lock);
-out:
 	mmap_read_unlock(current->mm);
 	return ret;
 }