KVM: arm64: Correctly handle the mmio faulting

Message ID	1603297010-18787-1-git-send-email-sashukla@nvidia.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=abDU=D4=vger.kernel.org=kvm-owner@kernel.org> TLS: TLSv1.2, AES256-SHA) id <B5f905ee00001>; Wed, 21 Oct 2020 09:16:32 -0700 From: Santosh Shukla <sashukla@nvidia.com> To: <maz@kernel.org>, <kvm@vger.kernel.org>, <kvmarm@lists.cs.columbia.edu>, <linux-kernel@vger.kernel.org> CC: <james.morse@arm.com>, <julien.thierry.kdev@gmail.com>, <suzuki.poulose@arm.com>, <linux-arm-kernel@lists.infradead.org>, <cjia@nvidia.com>, Santosh Shukla <sashukla@nvidia.com> Subject: [PATCH] KVM: arm64: Correctly handle the mmio faulting Date: Wed, 21 Oct 2020 21:46:50 +0530 Message-ID: <1603297010-18787-1-git-send-email-sashukla@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain Precedence: bulk
Series	KVM: arm64: Correctly handle the mmio faulting \| expand KVM: arm64: Correctly handle the mmio faulting

Santosh Shukla Oct. 21, 2020, 4:16 p.m. UTC

The Commit:6d674e28 introduces a notion to detect and handle the
device mapping. The commit checks for the VM_PFNMAP flag is set
in vma->flags and if set then marks force_pte to true such that
if force_pte is true then ignore the THP function check
(/transparent_hugepage_adjust()).

There could be an issue with the VM_PFNMAP flag setting and checking.
For example consider a case where the mdev vendor driver register's
the vma_fault handler named vma_mmio_fault(), which maps the
host MMIO region in-turn calls remap_pfn_range() and maps
the MMIO's vma space. Where, remap_pfn_range implicitly sets
the VM_PFNMAP flag into vma->flags.

Now lets assume a mmio fault handing flow where guest first access
the MMIO region whose 2nd stage translation is not present.
So that results to arm64-kvm hypervisor executing guest abort handler,
like below:

kvm_handle_guest_abort() -->
 user_mem_abort()--> {

    ...
    0. checks the vma->flags for the VM_PFNMAP.
    1. Since VM_PFNMAP flag is not yet set so force_pte _is_ false;
    2. gfn_to_pfn_prot() -->
        __gfn_to_pfn_memslot() -->
            fixup_user_fault() -->
                handle_mm_fault()-->
                    __do_fault() -->
                       vma_mmio_fault() --> // vendor's mdev fault handler
                        remap_pfn_range()--> // Here sets the VM_PFNMAP
						flag into vma->flags.
    3. Now that force_pte is set to false in step-2),
       will execute transparent_hugepage_adjust() func and
       that lead to Oops [4].
 }

The proposition is to check is_iomap flag before executing the THP
function transparent_hugepage_adjust().

[4] THP Oops:
> pc: kvm_is_transparent_hugepage+0x18/0xb0
> ...
> ...
> user_mem_abort+0x340/0x9b8
> kvm_handle_guest_abort+0x248/0x468
> handle_exit+0x150/0x1b0
> kvm_arch_vcpu_ioctl_run+0x4d4/0x778
> kvm_vcpu_ioctl+0x3c0/0x858
> ksys_ioctl+0x84/0xb8
> __arm64_sys_ioctl+0x28/0x38

Tested on Huawei Kunpeng Taishan-200 arm64 server, Using VFIO-mdev device.
Linux tip: 583090b1

Fixes: 6d674e28 ("KVM: arm/arm64: Properly handle faulting of device mappings")
Signed-off-by: Santosh Shukla <sashukla@nvidia.com>
---
 arch/arm64/kvm/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Marc Zyngier Oct. 23, 2020, 11:29 a.m. UTC | #1

Hi Santosh,

Thanks for this.

On 2020-10-21 17:16, Santosh Shukla wrote:
> The Commit:6d674e28 introduces a notion to detect and handle the
> device mapping. The commit checks for the VM_PFNMAP flag is set
> in vma->flags and if set then marks force_pte to true such that
> if force_pte is true then ignore the THP function check
> (/transparent_hugepage_adjust()).
> 
> There could be an issue with the VM_PFNMAP flag setting and checking.
> For example consider a case where the mdev vendor driver register's
> the vma_fault handler named vma_mmio_fault(), which maps the
> host MMIO region in-turn calls remap_pfn_range() and maps
> the MMIO's vma space. Where, remap_pfn_range implicitly sets
> the VM_PFNMAP flag into vma->flags.
> 
> Now lets assume a mmio fault handing flow where guest first access
> the MMIO region whose 2nd stage translation is not present.
> So that results to arm64-kvm hypervisor executing guest abort handler,
> like below:
> 
> kvm_handle_guest_abort() -->
>  user_mem_abort()--> {
> 
>     ...
>     0. checks the vma->flags for the VM_PFNMAP.
>     1. Since VM_PFNMAP flag is not yet set so force_pte _is_ false;
>     2. gfn_to_pfn_prot() -->
>         __gfn_to_pfn_memslot() -->
>             fixup_user_fault() -->
>                 handle_mm_fault()-->
>                     __do_fault() -->
>                        vma_mmio_fault() --> // vendor's mdev fault 
> handler
>                         remap_pfn_range()--> // Here sets the VM_PFNMAP
> 						flag into vma->flags.
>     3. Now that force_pte is set to false in step-2),
>        will execute transparent_hugepage_adjust() func and
>        that lead to Oops [4].
>  }

Hmmm. Nice. Any chance you could provide us with an actual reproducer?

> 
> The proposition is to check is_iomap flag before executing the THP
> function transparent_hugepage_adjust().
> 
> [4] THP Oops:
>> pc: kvm_is_transparent_hugepage+0x18/0xb0
>> ...
>> ...
>> user_mem_abort+0x340/0x9b8
>> kvm_handle_guest_abort+0x248/0x468
>> handle_exit+0x150/0x1b0
>> kvm_arch_vcpu_ioctl_run+0x4d4/0x778
>> kvm_vcpu_ioctl+0x3c0/0x858
>> ksys_ioctl+0x84/0xb8
>> __arm64_sys_ioctl+0x28/0x38
> 
> Tested on Huawei Kunpeng Taishan-200 arm64 server, Using VFIO-mdev 
> device.
> Linux tip: 583090b1
> 
> Fixes: 6d674e28 ("KVM: arm/arm64: Properly handle faulting of device 
> mappings")
> Signed-off-by: Santosh Shukla <sashukla@nvidia.com>
> ---
>  arch/arm64/kvm/mmu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 3d26b47..ff15357 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1947,7 +1947,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu,
> phys_addr_t fault_ipa,
>  	 * If we are not forced to use page mapping, check if we are
>  	 * backed by a THP and thus use block mapping if possible.
>  	 */
> -	if (vma_pagesize == PAGE_SIZE && !force_pte)
> +	if (vma_pagesize == PAGE_SIZE && !force_pte && !is_iomap(flags))
>  		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
>  							   &pfn, &fault_ipa);
>  	if (writable)

Why don't you directly set force_pte to true at the point where we 
update
the flags? It certainly would be a bit more readable:

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 3d26b47a1343..7a4ad984d54e 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1920,6 +1920,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
  	if (kvm_is_device_pfn(pfn)) {
  		mem_type = PAGE_S2_DEVICE;
  		flags |= KVM_S2PTE_FLAG_IS_IOMAP;
+		force_pte = true;
  	} else if (logging_active) {
  		/*
  		 * Faults on pages in a memslot with logging enabled

and almost directly applies to what we have queued for 5.10.

Thanks,

         M.

zhukeqian April 21, 2021, 2:59 a.m. UTC | #2

Hi Santosh,

On 2020/10/22 0:16, Santosh Shukla wrote:
> The Commit:6d674e28 introduces a notion to detect and handle the
> device mapping. The commit checks for the VM_PFNMAP flag is set
> in vma->flags and if set then marks force_pte to true such that
> if force_pte is true then ignore the THP function check
> (/transparent_hugepage_adjust()).
> 
> There could be an issue with the VM_PFNMAP flag setting and checking.
> For example consider a case where the mdev vendor driver register's
> the vma_fault handler named vma_mmio_fault(), which maps the
> host MMIO region in-turn calls remap_pfn_range() and maps
> the MMIO's vma space. Where, remap_pfn_range implicitly sets
> the VM_PFNMAP flag into vma->flags.
Could you give the name of the mdev vendor driver that triggers this issue?
I failed to find one according to your description. Thanks.


BRs,
Keqian


> 
> Now lets assume a mmio fault handing flow where guest first access
> the MMIO region whose 2nd stage translation is not present.
> So that results to arm64-kvm hypervisor executing guest abort handler,
> like below:
> 
> kvm_handle_guest_abort() -->
>  user_mem_abort()--> {
> 
>     ...
>     0. checks the vma->flags for the VM_PFNMAP.
>     1. Since VM_PFNMAP flag is not yet set so force_pte _is_ false;
>     2. gfn_to_pfn_prot() -->
>         __gfn_to_pfn_memslot() -->
>             fixup_user_fault() -->
>                 handle_mm_fault()-->
>                     __do_fault() -->
>                        vma_mmio_fault() --> // vendor's mdev fault handler
>                         remap_pfn_range()--> // Here sets the VM_PFNMAP
> 						flag into vma->flags.
>     3. Now that force_pte is set to false in step-2),
>        will execute transparent_hugepage_adjust() func and
>        that lead to Oops [4].
>  }
> 
> The proposition is to check is_iomap flag before executing the THP
> function transparent_hugepage_adjust().
> 
> [4] THP Oops:
>> pc: kvm_is_transparent_hugepage+0x18/0xb0
>> ...
>> ...
>> user_mem_abort+0x340/0x9b8
>> kvm_handle_guest_abort+0x248/0x468
>> handle_exit+0x150/0x1b0
>> kvm_arch_vcpu_ioctl_run+0x4d4/0x778
>> kvm_vcpu_ioctl+0x3c0/0x858
>> ksys_ioctl+0x84/0xb8
>> __arm64_sys_ioctl+0x28/0x38
> 
> Tested on Huawei Kunpeng Taishan-200 arm64 server, Using VFIO-mdev device.
> Linux tip: 583090b1
> 
> Fixes: 6d674e28 ("KVM: arm/arm64: Properly handle faulting of device mappings")
> Signed-off-by: Santosh Shukla <sashukla@nvidia.com>
> ---
>  arch/arm64/kvm/mmu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 3d26b47..ff15357 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1947,7 +1947,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	 * If we are not forced to use page mapping, check if we are
>  	 * backed by a THP and thus use block mapping if possible.
>  	 */
> -	if (vma_pagesize == PAGE_SIZE && !force_pte)
> +	if (vma_pagesize == PAGE_SIZE && !force_pte && !is_iomap(flags))
>  		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
>  							   &pfn, &fault_ipa);
>  	if (writable)
>

zhukeqian April 21, 2021, 6:17 a.m. UTC | #3

Hi Gavin,

On 2021/4/21 14:20, Gavin Shan wrote:
> Hi Keqian and Santosh,
> 
> On 4/21/21 12:59 PM, Keqian Zhu wrote:
>> On 2020/10/22 0:16, Santosh Shukla wrote:
>>> The Commit:6d674e28 introduces a notion to detect and handle the
>>> device mapping. The commit checks for the VM_PFNMAP flag is set
>>> in vma->flags and if set then marks force_pte to true such that
>>> if force_pte is true then ignore the THP function check
>>> (/transparent_hugepage_adjust()).
>>>
>>> There could be an issue with the VM_PFNMAP flag setting and checking.
>>> For example consider a case where the mdev vendor driver register's
>>> the vma_fault handler named vma_mmio_fault(), which maps the
>>> host MMIO region in-turn calls remap_pfn_range() and maps
>>> the MMIO's vma space. Where, remap_pfn_range implicitly sets
>>> the VM_PFNMAP flag into vma->flags.
>> Could you give the name of the mdev vendor driver that triggers this issue?
>> I failed to find one according to your description. Thanks.
>>
> 
> I think it would be fixed in driver side to set VM_PFNMAP in
> its mmap() callback (call_mmap()), like vfio PCI driver does.
> It means it won't be delayed until page fault is issued and
> remap_pfn_range() is called. It's determined from the beginning
> that the vma associated the mdev vendor driver is serving as
> PFN remapping purpose. So the vma should be populated completely,
> including the VM_PFNMAP flag before it becomes visible to user
> space.
> 
> The example can be found from vfio driver in drivers/vfio/pci/vfio_pci.c:
>     vfio_pci_mmap:       VM_PFNMAP is set for the vma
>     vfio_pci_mmap_fault: remap_pfn_range() is called
Right. I have discussed the above with Marc. I want to find the driver
to fix it. However, AFAICS, there is no driver matches the description...

Thanks,
Keqian

> 
> Thanks,
> Gavin
> 
>>>
>>> Now lets assume a mmio fault handing flow where guest first access
>>> the MMIO region whose 2nd stage translation is not present.
>>> So that results to arm64-kvm hypervisor executing guest abort handler,
>>> like below:
>>>
>>> kvm_handle_guest_abort() -->
>>>   user_mem_abort()--> {
>>>
>>>      ...
>>>      0. checks the vma->flags for the VM_PFNMAP.
>>>      1. Since VM_PFNMAP flag is not yet set so force_pte _is_ false;
>>>      2. gfn_to_pfn_prot() -->
>>>          __gfn_to_pfn_memslot() -->
>>>              fixup_user_fault() -->
>>>                  handle_mm_fault()-->
>>>                      __do_fault() -->
>>>                         vma_mmio_fault() --> // vendor's mdev fault handler
>>>                          remap_pfn_range()--> // Here sets the VM_PFNMAP
>>>                         flag into vma->flags.
>>>      3. Now that force_pte is set to false in step-2),
>>>         will execute transparent_hugepage_adjust() func and
>>>         that lead to Oops [4].
>>>   }
>>>
>>> The proposition is to check is_iomap flag before executing the THP
>>> function transparent_hugepage_adjust().
>>>
>>> [4] THP Oops:
>>>> pc: kvm_is_transparent_hugepage+0x18/0xb0
>>>> ...
>>>> ...
>>>> user_mem_abort+0x340/0x9b8
>>>> kvm_handle_guest_abort+0x248/0x468
>>>> handle_exit+0x150/0x1b0
>>>> kvm_arch_vcpu_ioctl_run+0x4d4/0x778
>>>> kvm_vcpu_ioctl+0x3c0/0x858
>>>> ksys_ioctl+0x84/0xb8
>>>> __arm64_sys_ioctl+0x28/0x38
>>>
>>> Tested on Huawei Kunpeng Taishan-200 arm64 server, Using VFIO-mdev device.
>>> Linux tip: 583090b1
>>>
>>> Fixes: 6d674e28 ("KVM: arm/arm64: Properly handle faulting of device mappings")
>>> Signed-off-by: Santosh Shukla <sashukla@nvidia.com>
>>> ---
>>>   arch/arm64/kvm/mmu.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>> index 3d26b47..ff15357 100644
>>> --- a/arch/arm64/kvm/mmu.c
>>> +++ b/arch/arm64/kvm/mmu.c
>>> @@ -1947,7 +1947,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>        * If we are not forced to use page mapping, check if we are
>>>        * backed by a THP and thus use block mapping if possible.
>>>        */
>>> -    if (vma_pagesize == PAGE_SIZE && !force_pte)
>>> +    if (vma_pagesize == PAGE_SIZE && !force_pte && !is_iomap(flags))
>>>           vma_pagesize = transparent_hugepage_adjust(memslot, hva,
>>>                                  &pfn, &fault_ipa);
>>>       if (writable)
>>>
>>
> 
> .
>

Gavin Shan April 21, 2021, 6:20 a.m. UTC | #4

Hi Keqian and Santosh,

On 4/21/21 12:59 PM, Keqian Zhu wrote:
> On 2020/10/22 0:16, Santosh Shukla wrote:
>> The Commit:6d674e28 introduces a notion to detect and handle the
>> device mapping. The commit checks for the VM_PFNMAP flag is set
>> in vma->flags and if set then marks force_pte to true such that
>> if force_pte is true then ignore the THP function check
>> (/transparent_hugepage_adjust()).
>>
>> There could be an issue with the VM_PFNMAP flag setting and checking.
>> For example consider a case where the mdev vendor driver register's
>> the vma_fault handler named vma_mmio_fault(), which maps the
>> host MMIO region in-turn calls remap_pfn_range() and maps
>> the MMIO's vma space. Where, remap_pfn_range implicitly sets
>> the VM_PFNMAP flag into vma->flags.
> Could you give the name of the mdev vendor driver that triggers this issue?
> I failed to find one according to your description. Thanks.
> 

I think it would be fixed in driver side to set VM_PFNMAP in
its mmap() callback (call_mmap()), like vfio PCI driver does.
It means it won't be delayed until page fault is issued and
remap_pfn_range() is called. It's determined from the beginning
that the vma associated the mdev vendor driver is serving as
PFN remapping purpose. So the vma should be populated completely,
including the VM_PFNMAP flag before it becomes visible to user
space.

The example can be found from vfio driver in drivers/vfio/pci/vfio_pci.c:
     vfio_pci_mmap:       VM_PFNMAP is set for the vma
     vfio_pci_mmap_fault: remap_pfn_range() is called

Thanks,
Gavin

>>
>> Now lets assume a mmio fault handing flow where guest first access
>> the MMIO region whose 2nd stage translation is not present.
>> So that results to arm64-kvm hypervisor executing guest abort handler,
>> like below:
>>
>> kvm_handle_guest_abort() -->
>>   user_mem_abort()--> {
>>
>>      ...
>>      0. checks the vma->flags for the VM_PFNMAP.
>>      1. Since VM_PFNMAP flag is not yet set so force_pte _is_ false;
>>      2. gfn_to_pfn_prot() -->
>>          __gfn_to_pfn_memslot() -->
>>              fixup_user_fault() -->
>>                  handle_mm_fault()-->
>>                      __do_fault() -->
>>                         vma_mmio_fault() --> // vendor's mdev fault handler
>>                          remap_pfn_range()--> // Here sets the VM_PFNMAP
>> 						flag into vma->flags.
>>      3. Now that force_pte is set to false in step-2),
>>         will execute transparent_hugepage_adjust() func and
>>         that lead to Oops [4].
>>   }
>>
>> The proposition is to check is_iomap flag before executing the THP
>> function transparent_hugepage_adjust().
>>
>> [4] THP Oops:
>>> pc: kvm_is_transparent_hugepage+0x18/0xb0
>>> ...
>>> ...
>>> user_mem_abort+0x340/0x9b8
>>> kvm_handle_guest_abort+0x248/0x468
>>> handle_exit+0x150/0x1b0
>>> kvm_arch_vcpu_ioctl_run+0x4d4/0x778
>>> kvm_vcpu_ioctl+0x3c0/0x858
>>> ksys_ioctl+0x84/0xb8
>>> __arm64_sys_ioctl+0x28/0x38
>>
>> Tested on Huawei Kunpeng Taishan-200 arm64 server, Using VFIO-mdev device.
>> Linux tip: 583090b1
>>
>> Fixes: 6d674e28 ("KVM: arm/arm64: Properly handle faulting of device mappings")
>> Signed-off-by: Santosh Shukla <sashukla@nvidia.com>
>> ---
>>   arch/arm64/kvm/mmu.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 3d26b47..ff15357 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -1947,7 +1947,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>   	 * If we are not forced to use page mapping, check if we are
>>   	 * backed by a THP and thus use block mapping if possible.
>>   	 */
>> -	if (vma_pagesize == PAGE_SIZE && !force_pte)
>> +	if (vma_pagesize == PAGE_SIZE && !force_pte && !is_iomap(flags))
>>   		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
>>   							   &pfn, &fault_ipa);
>>   	if (writable)
>>
>

Marc Zyngier April 21, 2021, 11:59 a.m. UTC | #5

On Wed, 21 Apr 2021 07:17:44 +0100,
Keqian Zhu <zhukeqian1@huawei.com> wrote:
> 
> Hi Gavin,
> 
> On 2021/4/21 14:20, Gavin Shan wrote:
> > Hi Keqian and Santosh,
> > 
> > On 4/21/21 12:59 PM, Keqian Zhu wrote:
> >> On 2020/10/22 0:16, Santosh Shukla wrote:
> >>> The Commit:6d674e28 introduces a notion to detect and handle the
> >>> device mapping. The commit checks for the VM_PFNMAP flag is set
> >>> in vma->flags and if set then marks force_pte to true such that
> >>> if force_pte is true then ignore the THP function check
> >>> (/transparent_hugepage_adjust()).
> >>>
> >>> There could be an issue with the VM_PFNMAP flag setting and checking.
> >>> For example consider a case where the mdev vendor driver register's
> >>> the vma_fault handler named vma_mmio_fault(), which maps the
> >>> host MMIO region in-turn calls remap_pfn_range() and maps
> >>> the MMIO's vma space. Where, remap_pfn_range implicitly sets
> >>> the VM_PFNMAP flag into vma->flags.
> >> Could you give the name of the mdev vendor driver that triggers this issue?
> >> I failed to find one according to your description. Thanks.
> >>
> > 
> > I think it would be fixed in driver side to set VM_PFNMAP in
> > its mmap() callback (call_mmap()), like vfio PCI driver does.
> > It means it won't be delayed until page fault is issued and
> > remap_pfn_range() is called. It's determined from the beginning
> > that the vma associated the mdev vendor driver is serving as
> > PFN remapping purpose. So the vma should be populated completely,
> > including the VM_PFNMAP flag before it becomes visible to user
> > space.

Why should that be a requirement? Lazy populating of the VMA should be
perfectly acceptable if the fault can only happen on the CPU side.

> > 
> > The example can be found from vfio driver in drivers/vfio/pci/vfio_pci.c:
> >     vfio_pci_mmap:       VM_PFNMAP is set for the vma
> >     vfio_pci_mmap_fault: remap_pfn_range() is called
> Right. I have discussed the above with Marc. I want to find the driver
> to fix it. However, AFAICS, there is no driver matches the description...

I have the feeling this is an out-of-tree driver (and Santosh email
address is bouncing, so I guess we won't have much information from
him).

However, the simple fact that any odd driver can provide a fault
handler and populate it the VMA on demand makes me think that we need
to support this case.

	M.

Gavin Shan April 22, 2021, 2:02 a.m. UTC | #6

Hi Marc,

On 4/21/21 9:59 PM, Marc Zyngier wrote:
> On Wed, 21 Apr 2021 07:17:44 +0100,
> Keqian Zhu <zhukeqian1@huawei.com> wrote:
>> On 2021/4/21 14:20, Gavin Shan wrote:
>>> On 4/21/21 12:59 PM, Keqian Zhu wrote:
>>>> On 2020/10/22 0:16, Santosh Shukla wrote:
>>>>> The Commit:6d674e28 introduces a notion to detect and handle the
>>>>> device mapping. The commit checks for the VM_PFNMAP flag is set
>>>>> in vma->flags and if set then marks force_pte to true such that
>>>>> if force_pte is true then ignore the THP function check
>>>>> (/transparent_hugepage_adjust()).
>>>>>
>>>>> There could be an issue with the VM_PFNMAP flag setting and checking.
>>>>> For example consider a case where the mdev vendor driver register's
>>>>> the vma_fault handler named vma_mmio_fault(), which maps the
>>>>> host MMIO region in-turn calls remap_pfn_range() and maps
>>>>> the MMIO's vma space. Where, remap_pfn_range implicitly sets
>>>>> the VM_PFNMAP flag into vma->flags.
>>>> Could you give the name of the mdev vendor driver that triggers this issue?
>>>> I failed to find one according to your description. Thanks.
>>>>
>>>
>>> I think it would be fixed in driver side to set VM_PFNMAP in
>>> its mmap() callback (call_mmap()), like vfio PCI driver does.
>>> It means it won't be delayed until page fault is issued and
>>> remap_pfn_range() is called. It's determined from the beginning
>>> that the vma associated the mdev vendor driver is serving as
>>> PFN remapping purpose. So the vma should be populated completely,
>>> including the VM_PFNMAP flag before it becomes visible to user
>>> space.
> 
> Why should that be a requirement? Lazy populating of the VMA should be
> perfectly acceptable if the fault can only happen on the CPU side.
> 

It isn't a requirement and the drivers needn't follow strictly. I checked
several drivers before looking into the patch and found almost all the
drivers have VM_PFNMAP set at mmap() time. In drivers/vfio/vfio-pci.c,
there is a comment as below, but it doesn't reveal too much about why
we can't set VM_PFNMAP at fault time.

static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
{
       :
         /*
          * See remap_pfn_range(), called from vfio_pci_fault() but we can't
          * change vm_flags within the fault handler.  Set them now.
          */
         vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
         vma->vm_ops = &vfio_pci_mmap_ops;

         return 0;
}

To set these flags in advance does have advantages. For example, VM_DONTEXPAND
prevents the vma to be merged with another one. VM_DONTDUMP make this vma
isn't eligible for coredump. Otherwise, the address space, which is associated
with the vma is accessed and unnecessary page faults are triggered on coredump.
VM_IO and VM_PFNMAP avoids to walk the page frames associated with the vma since
we don't have valid PFN in the mapping.

>>>
>>> The example can be found from vfio driver in drivers/vfio/pci/vfio_pci.c:
>>>      vfio_pci_mmap:       VM_PFNMAP is set for the vma
>>>      vfio_pci_mmap_fault: remap_pfn_range() is called
>> Right. I have discussed the above with Marc. I want to find the driver
>> to fix it. However, AFAICS, there is no driver matches the description...
> 
> I have the feeling this is an out-of-tree driver (and Santosh email
> address is bouncing, so I guess we won't have much information from
> him).
> 
> However, the simple fact that any odd driver can provide a fault
> handler and populate it the VMA on demand makes me think that we need
> to support this case.
> 

Yeah, Santosh's email isn't reachable. I think the VM_PFNMAP need to be
rechecked after gup if we're going to support this case.

Thanks,
Gavin

Marc Zyngier April 22, 2021, 6:50 a.m. UTC | #7

On Thu, 22 Apr 2021 03:02:00 +0100,
Gavin Shan <gshan@redhat.com> wrote:
> 
> Hi Marc,
> 
> On 4/21/21 9:59 PM, Marc Zyngier wrote:
> > On Wed, 21 Apr 2021 07:17:44 +0100,
> > Keqian Zhu <zhukeqian1@huawei.com> wrote:
> >> On 2021/4/21 14:20, Gavin Shan wrote:
> >>> On 4/21/21 12:59 PM, Keqian Zhu wrote:
> >>>> On 2020/10/22 0:16, Santosh Shukla wrote:
> >>>>> The Commit:6d674e28 introduces a notion to detect and handle the
> >>>>> device mapping. The commit checks for the VM_PFNMAP flag is set
> >>>>> in vma->flags and if set then marks force_pte to true such that
> >>>>> if force_pte is true then ignore the THP function check
> >>>>> (/transparent_hugepage_adjust()).
> >>>>> 
> >>>>> There could be an issue with the VM_PFNMAP flag setting and checking.
> >>>>> For example consider a case where the mdev vendor driver register's
> >>>>> the vma_fault handler named vma_mmio_fault(), which maps the
> >>>>> host MMIO region in-turn calls remap_pfn_range() and maps
> >>>>> the MMIO's vma space. Where, remap_pfn_range implicitly sets
> >>>>> the VM_PFNMAP flag into vma->flags.
> >>>> Could you give the name of the mdev vendor driver that triggers this issue?
> >>>> I failed to find one according to your description. Thanks.
> >>>> 
> >>> 
> >>> I think it would be fixed in driver side to set VM_PFNMAP in
> >>> its mmap() callback (call_mmap()), like vfio PCI driver does.
> >>> It means it won't be delayed until page fault is issued and
> >>> remap_pfn_range() is called. It's determined from the beginning
> >>> that the vma associated the mdev vendor driver is serving as
> >>> PFN remapping purpose. So the vma should be populated completely,
> >>> including the VM_PFNMAP flag before it becomes visible to user
> >>> space.
> > 
> > Why should that be a requirement? Lazy populating of the VMA should be
> > perfectly acceptable if the fault can only happen on the CPU side.
> > 
> 
> It isn't a requirement and the drivers needn't follow strictly. I checked
> several drivers before looking into the patch and found almost all the
> drivers have VM_PFNMAP set at mmap() time. In drivers/vfio/vfio-pci.c,
> there is a comment as below, but it doesn't reveal too much about why
> we can't set VM_PFNMAP at fault time.
> 
> static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> {
>       :
>         /*
>          * See remap_pfn_range(), called from vfio_pci_fault() but we can't
>          * change vm_flags within the fault handler.  Set them now.
>          */
>         vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
>         vma->vm_ops = &vfio_pci_mmap_ops;
> 
>         return 0;
> }
> 
> To set these flags in advance does have advantages. For example,
> VM_DONTEXPAND prevents the vma to be merged with another
> one. VM_DONTDUMP make this vma isn't eligible for
> coredump. Otherwise, the address space, which is associated with the
> vma is accessed and unnecessary page faults are triggered on
> coredump.  VM_IO and VM_PFNMAP avoids to walk the page frames
> associated with the vma since we don't have valid PFN in the
> mapping.

But PCI clearly isn't the case we are dealing with here, and not
everything is VFIO either. I can *today* create a driver that
implements a mmap+fault handler, call mmap() on it, pass the result to
a memslot, and get to the exact same result Santosh describes.

No PCI, no VFIO, just a random driver. We are *required* to handle
that.

	M.

Tarun Gupta (SW-GPU) April 22, 2021, 7:36 a.m. UTC | #8

On 4/22/2021 12:20 PM, Marc Zyngier wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Thu, 22 Apr 2021 03:02:00 +0100,
> Gavin Shan <gshan@redhat.com> wrote:
>>
>> Hi Marc,
>>
>> On 4/21/21 9:59 PM, Marc Zyngier wrote:
>>> On Wed, 21 Apr 2021 07:17:44 +0100,
>>> Keqian Zhu <zhukeqian1@huawei.com> wrote:
>>>> On 2021/4/21 14:20, Gavin Shan wrote:
>>>>> On 4/21/21 12:59 PM, Keqian Zhu wrote:
>>>>>> On 2020/10/22 0:16, Santosh Shukla wrote:
>>>>>>> The Commit:6d674e28 introduces a notion to detect and handle the
>>>>>>> device mapping. The commit checks for the VM_PFNMAP flag is set
>>>>>>> in vma->flags and if set then marks force_pte to true such that
>>>>>>> if force_pte is true then ignore the THP function check
>>>>>>> (/transparent_hugepage_adjust()).
>>>>>>>
>>>>>>> There could be an issue with the VM_PFNMAP flag setting and checking.
>>>>>>> For example consider a case where the mdev vendor driver register's
>>>>>>> the vma_fault handler named vma_mmio_fault(), which maps the
>>>>>>> host MMIO region in-turn calls remap_pfn_range() and maps
>>>>>>> the MMIO's vma space. Where, remap_pfn_range implicitly sets
>>>>>>> the VM_PFNMAP flag into vma->flags.
>>>>>> Could you give the name of the mdev vendor driver that triggers this issue?
>>>>>> I failed to find one according to your description. Thanks.
>>>>>>
>>>>>
>>>>> I think it would be fixed in driver side to set VM_PFNMAP in
>>>>> its mmap() callback (call_mmap()), like vfio PCI driver does.
>>>>> It means it won't be delayed until page fault is issued and
>>>>> remap_pfn_range() is called. It's determined from the beginning
>>>>> that the vma associated the mdev vendor driver is serving as
>>>>> PFN remapping purpose. So the vma should be populated completely,
>>>>> including the VM_PFNMAP flag before it becomes visible to user
>>>>> space.
>>>
>>> Why should that be a requirement? Lazy populating of the VMA should be
>>> perfectly acceptable if the fault can only happen on the CPU side.
>>>
>>
>> It isn't a requirement and the drivers needn't follow strictly. I checked
>> several drivers before looking into the patch and found almost all the
>> drivers have VM_PFNMAP set at mmap() time. In drivers/vfio/vfio-pci.c,
>> there is a comment as below, but it doesn't reveal too much about why
>> we can't set VM_PFNMAP at fault time.
>>
>> static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>> {
>>        :
>>          /*
>>           * See remap_pfn_range(), called from vfio_pci_fault() but we can't
>>           * change vm_flags within the fault handler.  Set them now.
>>           */
>>          vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
>>          vma->vm_ops = &vfio_pci_mmap_ops;
>>
>>          return 0;
>> }
>>
>> To set these flags in advance does have advantages. For example,
>> VM_DONTEXPAND prevents the vma to be merged with another
>> one. VM_DONTDUMP make this vma isn't eligible for
>> coredump. Otherwise, the address space, which is associated with the
>> vma is accessed and unnecessary page faults are triggered on
>> coredump.  VM_IO and VM_PFNMAP avoids to walk the page frames
>> associated with the vma since we don't have valid PFN in the
>> mapping.
> 
> But PCI clearly isn't the case we are dealing with here, and not
> everything is VFIO either. I can *today* create a driver that
> implements a mmap+fault handler, call mmap() on it, pass the result to
> a memslot, and get to the exact same result Santosh describes.
> 
> No PCI, no VFIO, just a random driver. We are *required* to handle
> that.

Agree with Marc here, that kernel should be able to handle it without 
VM_PFNMAP flag set in driver.

For driver reference, you could check the V2 version of this patch that 
got accepted upstream and has details as-to how this can be reproduced 
using vfio-pci: https://www.spinics.net/lists/arm-kernel/msg848491.html

> 
>          M.
> 
> --
> Without deviation from the norm, progress is not possible.
>

Santosh Shukla April 22, 2021, 8 a.m. UTC | #9

On Thu, Apr 22, 2021 at 1:07 PM Tarun Gupta (SW-GPU)
<targupta@nvidia.com> wrote:
>
>
>
> On 4/22/2021 12:20 PM, Marc Zyngier wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Thu, 22 Apr 2021 03:02:00 +0100,
> > Gavin Shan <gshan@redhat.com> wrote:
> >>
> >> Hi Marc,
> >>
> >> On 4/21/21 9:59 PM, Marc Zyngier wrote:
> >>> On Wed, 21 Apr 2021 07:17:44 +0100,
> >>> Keqian Zhu <zhukeqian1@huawei.com> wrote:
> >>>> On 2021/4/21 14:20, Gavin Shan wrote:
> >>>>> On 4/21/21 12:59 PM, Keqian Zhu wrote:
> >>>>>> On 2020/10/22 0:16, Santosh Shukla wrote:
> >>>>>>> The Commit:6d674e28 introduces a notion to detect and handle the
> >>>>>>> device mapping. The commit checks for the VM_PFNMAP flag is set
> >>>>>>> in vma->flags and if set then marks force_pte to true such that
> >>>>>>> if force_pte is true then ignore the THP function check
> >>>>>>> (/transparent_hugepage_adjust()).
> >>>>>>>
> >>>>>>> There could be an issue with the VM_PFNMAP flag setting and checking.
> >>>>>>> For example consider a case where the mdev vendor driver register's
> >>>>>>> the vma_fault handler named vma_mmio_fault(), which maps the
> >>>>>>> host MMIO region in-turn calls remap_pfn_range() and maps
> >>>>>>> the MMIO's vma space. Where, remap_pfn_range implicitly sets
> >>>>>>> the VM_PFNMAP flag into vma->flags.
> >>>>>> Could you give the name of the mdev vendor driver that triggers this issue?
> >>>>>> I failed to find one according to your description. Thanks.
> >>>>>>
> >>>>>
> >>>>> I think it would be fixed in driver side to set VM_PFNMAP in
> >>>>> its mmap() callback (call_mmap()), like vfio PCI driver does.
> >>>>> It means it won't be delayed until page fault is issued and
> >>>>> remap_pfn_range() is called. It's determined from the beginning
> >>>>> that the vma associated the mdev vendor driver is serving as
> >>>>> PFN remapping purpose. So the vma should be populated completely,
> >>>>> including the VM_PFNMAP flag before it becomes visible to user
> >>>>> space.
> >>>
> >>> Why should that be a requirement? Lazy populating of the VMA should be
> >>> perfectly acceptable if the fault can only happen on the CPU side.
> >>>

Right.
Hi keqian,
You can refer to case
http://lkml.iu.edu/hypermail/linux/kernel/2010.3/00952.html

(Sorry Guys, I am not with nvidia, but My quick input.)

> >>
> >> It isn't a requirement and the drivers needn't follow strictly. I checked
> >> several drivers before looking into the patch and found almost all the
> >> drivers have VM_PFNMAP set at mmap() time. In drivers/vfio/vfio-pci.c,
> >> there is a comment as below, but it doesn't reveal too much about why
> >> we can't set VM_PFNMAP at fault time.
> >>
> >> static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> >> {
> >>        :
> >>          /*
> >>           * See remap_pfn_range(), called from vfio_pci_fault() but we can't
> >>           * change vm_flags within the fault handler.  Set them now.
> >>           */
> >>          vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
> >>          vma->vm_ops = &vfio_pci_mmap_ops;
> >>
> >>          return 0;
> >> }
> >>
> >> To set these flags in advance does have advantages. For example,
> >> VM_DONTEXPAND prevents the vma to be merged with another
> >> one. VM_DONTDUMP make this vma isn't eligible for
> >> coredump. Otherwise, the address space, which is associated with the
> >> vma is accessed and unnecessary page faults are triggered on
> >> coredump.  VM_IO and VM_PFNMAP avoids to walk the page frames
> >> associated with the vma since we don't have valid PFN in the
> >> mapping.
> >
> > But PCI clearly isn't the case we are dealing with here, and not
> > everything is VFIO either. I can *today* create a driver that
> > implements a mmap+fault handler, call mmap() on it, pass the result to
> > a memslot, and get to the exact same result Santosh describes.
> >
> > No PCI, no VFIO, just a random driver. We are *required* to handle
> > that.
>
> Agree with Marc here, that kernel should be able to handle it without
> VM_PFNMAP flag set in driver.
>
> For driver reference, you could check the V2 version of this patch that
> got accepted upstream and has details as-to how this can be reproduced
> using vfio-pci: https://www.spinics.net/lists/arm-kernel/msg848491.html
>
> >
> >          M.
> >
> > --
> > Without deviation from the norm, progress is not possible.
> >

zhukeqian April 23, 2021, 1:06 a.m. UTC | #10

On 2021/4/22 16:00, Santosh Shukla wrote:
> On Thu, Apr 22, 2021 at 1:07 PM Tarun Gupta (SW-GPU)
> <targupta@nvidia.com> wrote:
>>
>>
>>
>> On 4/22/2021 12:20 PM, Marc Zyngier wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Thu, 22 Apr 2021 03:02:00 +0100,
>>> Gavin Shan <gshan@redhat.com> wrote:
>>>>
>>>> Hi Marc,
>>>>
>>>> On 4/21/21 9:59 PM, Marc Zyngier wrote:
>>>>> On Wed, 21 Apr 2021 07:17:44 +0100,
>>>>> Keqian Zhu <zhukeqian1@huawei.com> wrote:
>>>>>> On 2021/4/21 14:20, Gavin Shan wrote:
>>>>>>> On 4/21/21 12:59 PM, Keqian Zhu wrote:
>>>>>>>> On 2020/10/22 0:16, Santosh Shukla wrote:
>>>>>>>>> The Commit:6d674e28 introduces a notion to detect and handle the
>>>>>>>>> device mapping. The commit checks for the VM_PFNMAP flag is set
>>>>>>>>> in vma->flags and if set then marks force_pte to true such that
>>>>>>>>> if force_pte is true then ignore the THP function check
>>>>>>>>> (/transparent_hugepage_adjust()).
>>>>>>>>>
>>>>>>>>> There could be an issue with the VM_PFNMAP flag setting and checking.
>>>>>>>>> For example consider a case where the mdev vendor driver register's
>>>>>>>>> the vma_fault handler named vma_mmio_fault(), which maps the
>>>>>>>>> host MMIO region in-turn calls remap_pfn_range() and maps
>>>>>>>>> the MMIO's vma space. Where, remap_pfn_range implicitly sets
>>>>>>>>> the VM_PFNMAP flag into vma->flags.
>>>>>>>> Could you give the name of the mdev vendor driver that triggers this issue?
>>>>>>>> I failed to find one according to your description. Thanks.
>>>>>>>>
>>>>>>>
>>>>>>> I think it would be fixed in driver side to set VM_PFNMAP in
>>>>>>> its mmap() callback (call_mmap()), like vfio PCI driver does.
>>>>>>> It means it won't be delayed until page fault is issued and
>>>>>>> remap_pfn_range() is called. It's determined from the beginning
>>>>>>> that the vma associated the mdev vendor driver is serving as
>>>>>>> PFN remapping purpose. So the vma should be populated completely,
>>>>>>> including the VM_PFNMAP flag before it becomes visible to user
>>>>>>> space.
>>>>>
>>>>> Why should that be a requirement? Lazy populating of the VMA should be
>>>>> perfectly acceptable if the fault can only happen on the CPU side.
>>>>>
> 
> Right.
> Hi keqian,
> You can refer to case
> http://lkml.iu.edu/hypermail/linux/kernel/2010.3/00952.html
Hi Santosh,

Yeah, thanks for that.

BRs,
Keqian

Gavin Shan April 23, 2021, 1:38 a.m. UTC | #11

Hi Marc,

On 4/22/21 4:50 PM, Marc Zyngier wrote:
> On Thu, 22 Apr 2021 03:02:00 +0100,
> Gavin Shan <gshan@redhat.com> wrote:
>> On 4/21/21 9:59 PM, Marc Zyngier wrote:
>>> On Wed, 21 Apr 2021 07:17:44 +0100,
>>> Keqian Zhu <zhukeqian1@huawei.com> wrote:
>>>> On 2021/4/21 14:20, Gavin Shan wrote:
>>>>> On 4/21/21 12:59 PM, Keqian Zhu wrote:
>>>>>> On 2020/10/22 0:16, Santosh Shukla wrote:
>>>>>>> The Commit:6d674e28 introduces a notion to detect and handle the
>>>>>>> device mapping. The commit checks for the VM_PFNMAP flag is set
>>>>>>> in vma->flags and if set then marks force_pte to true such that
>>>>>>> if force_pte is true then ignore the THP function check
>>>>>>> (/transparent_hugepage_adjust()).
>>>>>>>
>>>>>>> There could be an issue with the VM_PFNMAP flag setting and checking.
>>>>>>> For example consider a case where the mdev vendor driver register's
>>>>>>> the vma_fault handler named vma_mmio_fault(), which maps the
>>>>>>> host MMIO region in-turn calls remap_pfn_range() and maps
>>>>>>> the MMIO's vma space. Where, remap_pfn_range implicitly sets
>>>>>>> the VM_PFNMAP flag into vma->flags.
>>>>>> Could you give the name of the mdev vendor driver that triggers this issue?
>>>>>> I failed to find one according to your description. Thanks.
>>>>>>
>>>>>
>>>>> I think it would be fixed in driver side to set VM_PFNMAP in
>>>>> its mmap() callback (call_mmap()), like vfio PCI driver does.
>>>>> It means it won't be delayed until page fault is issued and
>>>>> remap_pfn_range() is called. It's determined from the beginning
>>>>> that the vma associated the mdev vendor driver is serving as
>>>>> PFN remapping purpose. So the vma should be populated completely,
>>>>> including the VM_PFNMAP flag before it becomes visible to user
>>>>> space.
>>>
>>> Why should that be a requirement? Lazy populating of the VMA should be
>>> perfectly acceptable if the fault can only happen on the CPU side.
>>>
>>
>> It isn't a requirement and the drivers needn't follow strictly. I checked
>> several drivers before looking into the patch and found almost all the
>> drivers have VM_PFNMAP set at mmap() time. In drivers/vfio/vfio-pci.c,
>> there is a comment as below, but it doesn't reveal too much about why
>> we can't set VM_PFNMAP at fault time.
>>
>> static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>> {
>>        :
>>          /*
>>           * See remap_pfn_range(), called from vfio_pci_fault() but we can't
>>           * change vm_flags within the fault handler.  Set them now.
>>           */
>>          vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
>>          vma->vm_ops = &vfio_pci_mmap_ops;
>>
>>          return 0;
>> }
>>
>> To set these flags in advance does have advantages. For example,
>> VM_DONTEXPAND prevents the vma to be merged with another
>> one. VM_DONTDUMP make this vma isn't eligible for
>> coredump. Otherwise, the address space, which is associated with the
>> vma is accessed and unnecessary page faults are triggered on
>> coredump.  VM_IO and VM_PFNMAP avoids to walk the page frames
>> associated with the vma since we don't have valid PFN in the
>> mapping.
> 
> But PCI clearly isn't the case we are dealing with here, and not
> everything is VFIO either. I can *today* create a driver that
> implements a mmap+fault handler, call mmap() on it, pass the result to
> a memslot, and get to the exact same result Santosh describes.
> 
> No PCI, no VFIO, just a random driver. We are *required* to handle
> that.
> 

hmm, ok. I was thinking it's related to VFIO mdev driver when Santosh was
talking about "mdev driver". Anyway, it's always nice to support the case :)

Thanks,
Gavin

KVM: arm64: Correctly handle the mmio faulting

Commit Message

Comments

Patch