Message ID | 1603297010-18787-1-git-send-email-sashukla@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: arm64: Correctly handle the mmio faulting | expand |
Hi Santosh, Thanks for this. On 2020-10-21 17:16, Santosh Shukla wrote: > The Commit:6d674e28 introduces a notion to detect and handle the > device mapping. The commit checks for the VM_PFNMAP flag is set > in vma->flags and if set then marks force_pte to true such that > if force_pte is true then ignore the THP function check > (/transparent_hugepage_adjust()). > > There could be an issue with the VM_PFNMAP flag setting and checking. > For example consider a case where the mdev vendor driver register's > the vma_fault handler named vma_mmio_fault(), which maps the > host MMIO region in-turn calls remap_pfn_range() and maps > the MMIO's vma space. Where, remap_pfn_range implicitly sets > the VM_PFNMAP flag into vma->flags. > > Now lets assume a mmio fault handing flow where guest first access > the MMIO region whose 2nd stage translation is not present. > So that results to arm64-kvm hypervisor executing guest abort handler, > like below: > > kvm_handle_guest_abort() --> > user_mem_abort()--> { > > ... > 0. checks the vma->flags for the VM_PFNMAP. > 1. Since VM_PFNMAP flag is not yet set so force_pte _is_ false; > 2. gfn_to_pfn_prot() --> > __gfn_to_pfn_memslot() --> > fixup_user_fault() --> > handle_mm_fault()--> > __do_fault() --> > vma_mmio_fault() --> // vendor's mdev fault > handler > remap_pfn_range()--> // Here sets the VM_PFNMAP > flag into vma->flags. > 3. Now that force_pte is set to false in step-2), > will execute transparent_hugepage_adjust() func and > that lead to Oops [4]. > } Hmmm. Nice. Any chance you could provide us with an actual reproducer? > > The proposition is to check is_iomap flag before executing the THP > function transparent_hugepage_adjust(). > > [4] THP Oops: >> pc: kvm_is_transparent_hugepage+0x18/0xb0 >> ... >> ... >> user_mem_abort+0x340/0x9b8 >> kvm_handle_guest_abort+0x248/0x468 >> handle_exit+0x150/0x1b0 >> kvm_arch_vcpu_ioctl_run+0x4d4/0x778 >> kvm_vcpu_ioctl+0x3c0/0x858 >> ksys_ioctl+0x84/0xb8 >> __arm64_sys_ioctl+0x28/0x38 > > Tested on Huawei Kunpeng Taishan-200 arm64 server, Using VFIO-mdev > device. > Linux tip: 583090b1 > > Fixes: 6d674e28 ("KVM: arm/arm64: Properly handle faulting of device > mappings") > Signed-off-by: Santosh Shukla <sashukla@nvidia.com> > --- > arch/arm64/kvm/mmu.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > index 3d26b47..ff15357 100644 > --- a/arch/arm64/kvm/mmu.c > +++ b/arch/arm64/kvm/mmu.c > @@ -1947,7 +1947,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, > phys_addr_t fault_ipa, > * If we are not forced to use page mapping, check if we are > * backed by a THP and thus use block mapping if possible. > */ > - if (vma_pagesize == PAGE_SIZE && !force_pte) > + if (vma_pagesize == PAGE_SIZE && !force_pte && !is_iomap(flags)) > vma_pagesize = transparent_hugepage_adjust(memslot, hva, > &pfn, &fault_ipa); > if (writable) Why don't you directly set force_pte to true at the point where we update the flags? It certainly would be a bit more readable: diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 3d26b47a1343..7a4ad984d54e 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1920,6 +1920,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, if (kvm_is_device_pfn(pfn)) { mem_type = PAGE_S2_DEVICE; flags |= KVM_S2PTE_FLAG_IS_IOMAP; + force_pte = true; } else if (logging_active) { /* * Faults on pages in a memslot with logging enabled and almost directly applies to what we have queued for 5.10. Thanks, M.
Hi Santosh, On 2020/10/22 0:16, Santosh Shukla wrote: > The Commit:6d674e28 introduces a notion to detect and handle the > device mapping. The commit checks for the VM_PFNMAP flag is set > in vma->flags and if set then marks force_pte to true such that > if force_pte is true then ignore the THP function check > (/transparent_hugepage_adjust()). > > There could be an issue with the VM_PFNMAP flag setting and checking. > For example consider a case where the mdev vendor driver register's > the vma_fault handler named vma_mmio_fault(), which maps the > host MMIO region in-turn calls remap_pfn_range() and maps > the MMIO's vma space. Where, remap_pfn_range implicitly sets > the VM_PFNMAP flag into vma->flags. Could you give the name of the mdev vendor driver that triggers this issue? I failed to find one according to your description. Thanks. BRs, Keqian > > Now lets assume a mmio fault handing flow where guest first access > the MMIO region whose 2nd stage translation is not present. > So that results to arm64-kvm hypervisor executing guest abort handler, > like below: > > kvm_handle_guest_abort() --> > user_mem_abort()--> { > > ... > 0. checks the vma->flags for the VM_PFNMAP. > 1. Since VM_PFNMAP flag is not yet set so force_pte _is_ false; > 2. gfn_to_pfn_prot() --> > __gfn_to_pfn_memslot() --> > fixup_user_fault() --> > handle_mm_fault()--> > __do_fault() --> > vma_mmio_fault() --> // vendor's mdev fault handler > remap_pfn_range()--> // Here sets the VM_PFNMAP > flag into vma->flags. > 3. Now that force_pte is set to false in step-2), > will execute transparent_hugepage_adjust() func and > that lead to Oops [4]. > } > > The proposition is to check is_iomap flag before executing the THP > function transparent_hugepage_adjust(). > > [4] THP Oops: >> pc: kvm_is_transparent_hugepage+0x18/0xb0 >> ... >> ... >> user_mem_abort+0x340/0x9b8 >> kvm_handle_guest_abort+0x248/0x468 >> handle_exit+0x150/0x1b0 >> kvm_arch_vcpu_ioctl_run+0x4d4/0x778 >> kvm_vcpu_ioctl+0x3c0/0x858 >> ksys_ioctl+0x84/0xb8 >> __arm64_sys_ioctl+0x28/0x38 > > Tested on Huawei Kunpeng Taishan-200 arm64 server, Using VFIO-mdev device. > Linux tip: 583090b1 > > Fixes: 6d674e28 ("KVM: arm/arm64: Properly handle faulting of device mappings") > Signed-off-by: Santosh Shukla <sashukla@nvidia.com> > --- > arch/arm64/kvm/mmu.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > index 3d26b47..ff15357 100644 > --- a/arch/arm64/kvm/mmu.c > +++ b/arch/arm64/kvm/mmu.c > @@ -1947,7 +1947,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > * If we are not forced to use page mapping, check if we are > * backed by a THP and thus use block mapping if possible. > */ > - if (vma_pagesize == PAGE_SIZE && !force_pte) > + if (vma_pagesize == PAGE_SIZE && !force_pte && !is_iomap(flags)) > vma_pagesize = transparent_hugepage_adjust(memslot, hva, > &pfn, &fault_ipa); > if (writable) >
Hi Gavin, On 2021/4/21 14:20, Gavin Shan wrote: > Hi Keqian and Santosh, > > On 4/21/21 12:59 PM, Keqian Zhu wrote: >> On 2020/10/22 0:16, Santosh Shukla wrote: >>> The Commit:6d674e28 introduces a notion to detect and handle the >>> device mapping. The commit checks for the VM_PFNMAP flag is set >>> in vma->flags and if set then marks force_pte to true such that >>> if force_pte is true then ignore the THP function check >>> (/transparent_hugepage_adjust()). >>> >>> There could be an issue with the VM_PFNMAP flag setting and checking. >>> For example consider a case where the mdev vendor driver register's >>> the vma_fault handler named vma_mmio_fault(), which maps the >>> host MMIO region in-turn calls remap_pfn_range() and maps >>> the MMIO's vma space. Where, remap_pfn_range implicitly sets >>> the VM_PFNMAP flag into vma->flags. >> Could you give the name of the mdev vendor driver that triggers this issue? >> I failed to find one according to your description. Thanks. >> > > I think it would be fixed in driver side to set VM_PFNMAP in > its mmap() callback (call_mmap()), like vfio PCI driver does. > It means it won't be delayed until page fault is issued and > remap_pfn_range() is called. It's determined from the beginning > that the vma associated the mdev vendor driver is serving as > PFN remapping purpose. So the vma should be populated completely, > including the VM_PFNMAP flag before it becomes visible to user > space. > > The example can be found from vfio driver in drivers/vfio/pci/vfio_pci.c: > vfio_pci_mmap: VM_PFNMAP is set for the vma > vfio_pci_mmap_fault: remap_pfn_range() is called Right. I have discussed the above with Marc. I want to find the driver to fix it. However, AFAICS, there is no driver matches the description... Thanks, Keqian > > Thanks, > Gavin > >>> >>> Now lets assume a mmio fault handing flow where guest first access >>> the MMIO region whose 2nd stage translation is not present. >>> So that results to arm64-kvm hypervisor executing guest abort handler, >>> like below: >>> >>> kvm_handle_guest_abort() --> >>> user_mem_abort()--> { >>> >>> ... >>> 0. checks the vma->flags for the VM_PFNMAP. >>> 1. Since VM_PFNMAP flag is not yet set so force_pte _is_ false; >>> 2. gfn_to_pfn_prot() --> >>> __gfn_to_pfn_memslot() --> >>> fixup_user_fault() --> >>> handle_mm_fault()--> >>> __do_fault() --> >>> vma_mmio_fault() --> // vendor's mdev fault handler >>> remap_pfn_range()--> // Here sets the VM_PFNMAP >>> flag into vma->flags. >>> 3. Now that force_pte is set to false in step-2), >>> will execute transparent_hugepage_adjust() func and >>> that lead to Oops [4]. >>> } >>> >>> The proposition is to check is_iomap flag before executing the THP >>> function transparent_hugepage_adjust(). >>> >>> [4] THP Oops: >>>> pc: kvm_is_transparent_hugepage+0x18/0xb0 >>>> ... >>>> ... >>>> user_mem_abort+0x340/0x9b8 >>>> kvm_handle_guest_abort+0x248/0x468 >>>> handle_exit+0x150/0x1b0 >>>> kvm_arch_vcpu_ioctl_run+0x4d4/0x778 >>>> kvm_vcpu_ioctl+0x3c0/0x858 >>>> ksys_ioctl+0x84/0xb8 >>>> __arm64_sys_ioctl+0x28/0x38 >>> >>> Tested on Huawei Kunpeng Taishan-200 arm64 server, Using VFIO-mdev device. >>> Linux tip: 583090b1 >>> >>> Fixes: 6d674e28 ("KVM: arm/arm64: Properly handle faulting of device mappings") >>> Signed-off-by: Santosh Shukla <sashukla@nvidia.com> >>> --- >>> arch/arm64/kvm/mmu.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>> index 3d26b47..ff15357 100644 >>> --- a/arch/arm64/kvm/mmu.c >>> +++ b/arch/arm64/kvm/mmu.c >>> @@ -1947,7 +1947,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >>> * If we are not forced to use page mapping, check if we are >>> * backed by a THP and thus use block mapping if possible. >>> */ >>> - if (vma_pagesize == PAGE_SIZE && !force_pte) >>> + if (vma_pagesize == PAGE_SIZE && !force_pte && !is_iomap(flags)) >>> vma_pagesize = transparent_hugepage_adjust(memslot, hva, >>> &pfn, &fault_ipa); >>> if (writable) >>> >> > > . >
Hi Keqian and Santosh, On 4/21/21 12:59 PM, Keqian Zhu wrote: > On 2020/10/22 0:16, Santosh Shukla wrote: >> The Commit:6d674e28 introduces a notion to detect and handle the >> device mapping. The commit checks for the VM_PFNMAP flag is set >> in vma->flags and if set then marks force_pte to true such that >> if force_pte is true then ignore the THP function check >> (/transparent_hugepage_adjust()). >> >> There could be an issue with the VM_PFNMAP flag setting and checking. >> For example consider a case where the mdev vendor driver register's >> the vma_fault handler named vma_mmio_fault(), which maps the >> host MMIO region in-turn calls remap_pfn_range() and maps >> the MMIO's vma space. Where, remap_pfn_range implicitly sets >> the VM_PFNMAP flag into vma->flags. > Could you give the name of the mdev vendor driver that triggers this issue? > I failed to find one according to your description. Thanks. > I think it would be fixed in driver side to set VM_PFNMAP in its mmap() callback (call_mmap()), like vfio PCI driver does. It means it won't be delayed until page fault is issued and remap_pfn_range() is called. It's determined from the beginning that the vma associated the mdev vendor driver is serving as PFN remapping purpose. So the vma should be populated completely, including the VM_PFNMAP flag before it becomes visible to user space. The example can be found from vfio driver in drivers/vfio/pci/vfio_pci.c: vfio_pci_mmap: VM_PFNMAP is set for the vma vfio_pci_mmap_fault: remap_pfn_range() is called Thanks, Gavin >> >> Now lets assume a mmio fault handing flow where guest first access >> the MMIO region whose 2nd stage translation is not present. >> So that results to arm64-kvm hypervisor executing guest abort handler, >> like below: >> >> kvm_handle_guest_abort() --> >> user_mem_abort()--> { >> >> ... >> 0. checks the vma->flags for the VM_PFNMAP. >> 1. Since VM_PFNMAP flag is not yet set so force_pte _is_ false; >> 2. gfn_to_pfn_prot() --> >> __gfn_to_pfn_memslot() --> >> fixup_user_fault() --> >> handle_mm_fault()--> >> __do_fault() --> >> vma_mmio_fault() --> // vendor's mdev fault handler >> remap_pfn_range()--> // Here sets the VM_PFNMAP >> flag into vma->flags. >> 3. Now that force_pte is set to false in step-2), >> will execute transparent_hugepage_adjust() func and >> that lead to Oops [4]. >> } >> >> The proposition is to check is_iomap flag before executing the THP >> function transparent_hugepage_adjust(). >> >> [4] THP Oops: >>> pc: kvm_is_transparent_hugepage+0x18/0xb0 >>> ... >>> ... >>> user_mem_abort+0x340/0x9b8 >>> kvm_handle_guest_abort+0x248/0x468 >>> handle_exit+0x150/0x1b0 >>> kvm_arch_vcpu_ioctl_run+0x4d4/0x778 >>> kvm_vcpu_ioctl+0x3c0/0x858 >>> ksys_ioctl+0x84/0xb8 >>> __arm64_sys_ioctl+0x28/0x38 >> >> Tested on Huawei Kunpeng Taishan-200 arm64 server, Using VFIO-mdev device. >> Linux tip: 583090b1 >> >> Fixes: 6d674e28 ("KVM: arm/arm64: Properly handle faulting of device mappings") >> Signed-off-by: Santosh Shukla <sashukla@nvidia.com> >> --- >> arch/arm64/kvm/mmu.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >> index 3d26b47..ff15357 100644 >> --- a/arch/arm64/kvm/mmu.c >> +++ b/arch/arm64/kvm/mmu.c >> @@ -1947,7 +1947,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >> * If we are not forced to use page mapping, check if we are >> * backed by a THP and thus use block mapping if possible. >> */ >> - if (vma_pagesize == PAGE_SIZE && !force_pte) >> + if (vma_pagesize == PAGE_SIZE && !force_pte && !is_iomap(flags)) >> vma_pagesize = transparent_hugepage_adjust(memslot, hva, >> &pfn, &fault_ipa); >> if (writable) >> >
On Wed, 21 Apr 2021 07:17:44 +0100, Keqian Zhu <zhukeqian1@huawei.com> wrote: > > Hi Gavin, > > On 2021/4/21 14:20, Gavin Shan wrote: > > Hi Keqian and Santosh, > > > > On 4/21/21 12:59 PM, Keqian Zhu wrote: > >> On 2020/10/22 0:16, Santosh Shukla wrote: > >>> The Commit:6d674e28 introduces a notion to detect and handle the > >>> device mapping. The commit checks for the VM_PFNMAP flag is set > >>> in vma->flags and if set then marks force_pte to true such that > >>> if force_pte is true then ignore the THP function check > >>> (/transparent_hugepage_adjust()). > >>> > >>> There could be an issue with the VM_PFNMAP flag setting and checking. > >>> For example consider a case where the mdev vendor driver register's > >>> the vma_fault handler named vma_mmio_fault(), which maps the > >>> host MMIO region in-turn calls remap_pfn_range() and maps > >>> the MMIO's vma space. Where, remap_pfn_range implicitly sets > >>> the VM_PFNMAP flag into vma->flags. > >> Could you give the name of the mdev vendor driver that triggers this issue? > >> I failed to find one according to your description. Thanks. > >> > > > > I think it would be fixed in driver side to set VM_PFNMAP in > > its mmap() callback (call_mmap()), like vfio PCI driver does. > > It means it won't be delayed until page fault is issued and > > remap_pfn_range() is called. It's determined from the beginning > > that the vma associated the mdev vendor driver is serving as > > PFN remapping purpose. So the vma should be populated completely, > > including the VM_PFNMAP flag before it becomes visible to user > > space. Why should that be a requirement? Lazy populating of the VMA should be perfectly acceptable if the fault can only happen on the CPU side. > > > > The example can be found from vfio driver in drivers/vfio/pci/vfio_pci.c: > > vfio_pci_mmap: VM_PFNMAP is set for the vma > > vfio_pci_mmap_fault: remap_pfn_range() is called > Right. I have discussed the above with Marc. I want to find the driver > to fix it. However, AFAICS, there is no driver matches the description... I have the feeling this is an out-of-tree driver (and Santosh email address is bouncing, so I guess we won't have much information from him). However, the simple fact that any odd driver can provide a fault handler and populate it the VMA on demand makes me think that we need to support this case. M.
Hi Marc, On 4/21/21 9:59 PM, Marc Zyngier wrote: > On Wed, 21 Apr 2021 07:17:44 +0100, > Keqian Zhu <zhukeqian1@huawei.com> wrote: >> On 2021/4/21 14:20, Gavin Shan wrote: >>> On 4/21/21 12:59 PM, Keqian Zhu wrote: >>>> On 2020/10/22 0:16, Santosh Shukla wrote: >>>>> The Commit:6d674e28 introduces a notion to detect and handle the >>>>> device mapping. The commit checks for the VM_PFNMAP flag is set >>>>> in vma->flags and if set then marks force_pte to true such that >>>>> if force_pte is true then ignore the THP function check >>>>> (/transparent_hugepage_adjust()). >>>>> >>>>> There could be an issue with the VM_PFNMAP flag setting and checking. >>>>> For example consider a case where the mdev vendor driver register's >>>>> the vma_fault handler named vma_mmio_fault(), which maps the >>>>> host MMIO region in-turn calls remap_pfn_range() and maps >>>>> the MMIO's vma space. Where, remap_pfn_range implicitly sets >>>>> the VM_PFNMAP flag into vma->flags. >>>> Could you give the name of the mdev vendor driver that triggers this issue? >>>> I failed to find one according to your description. Thanks. >>>> >>> >>> I think it would be fixed in driver side to set VM_PFNMAP in >>> its mmap() callback (call_mmap()), like vfio PCI driver does. >>> It means it won't be delayed until page fault is issued and >>> remap_pfn_range() is called. It's determined from the beginning >>> that the vma associated the mdev vendor driver is serving as >>> PFN remapping purpose. So the vma should be populated completely, >>> including the VM_PFNMAP flag before it becomes visible to user >>> space. > > Why should that be a requirement? Lazy populating of the VMA should be > perfectly acceptable if the fault can only happen on the CPU side. > It isn't a requirement and the drivers needn't follow strictly. I checked several drivers before looking into the patch and found almost all the drivers have VM_PFNMAP set at mmap() time. In drivers/vfio/vfio-pci.c, there is a comment as below, but it doesn't reveal too much about why we can't set VM_PFNMAP at fault time. static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma) { : /* * See remap_pfn_range(), called from vfio_pci_fault() but we can't * change vm_flags within the fault handler. Set them now. */ vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP; vma->vm_ops = &vfio_pci_mmap_ops; return 0; } To set these flags in advance does have advantages. For example, VM_DONTEXPAND prevents the vma to be merged with another one. VM_DONTDUMP make this vma isn't eligible for coredump. Otherwise, the address space, which is associated with the vma is accessed and unnecessary page faults are triggered on coredump. VM_IO and VM_PFNMAP avoids to walk the page frames associated with the vma since we don't have valid PFN in the mapping. >>> >>> The example can be found from vfio driver in drivers/vfio/pci/vfio_pci.c: >>> vfio_pci_mmap: VM_PFNMAP is set for the vma >>> vfio_pci_mmap_fault: remap_pfn_range() is called >> Right. I have discussed the above with Marc. I want to find the driver >> to fix it. However, AFAICS, there is no driver matches the description... > > I have the feeling this is an out-of-tree driver (and Santosh email > address is bouncing, so I guess we won't have much information from > him). > > However, the simple fact that any odd driver can provide a fault > handler and populate it the VMA on demand makes me think that we need > to support this case. > Yeah, Santosh's email isn't reachable. I think the VM_PFNMAP need to be rechecked after gup if we're going to support this case. Thanks, Gavin
On Thu, 22 Apr 2021 03:02:00 +0100, Gavin Shan <gshan@redhat.com> wrote: > > Hi Marc, > > On 4/21/21 9:59 PM, Marc Zyngier wrote: > > On Wed, 21 Apr 2021 07:17:44 +0100, > > Keqian Zhu <zhukeqian1@huawei.com> wrote: > >> On 2021/4/21 14:20, Gavin Shan wrote: > >>> On 4/21/21 12:59 PM, Keqian Zhu wrote: > >>>> On 2020/10/22 0:16, Santosh Shukla wrote: > >>>>> The Commit:6d674e28 introduces a notion to detect and handle the > >>>>> device mapping. The commit checks for the VM_PFNMAP flag is set > >>>>> in vma->flags and if set then marks force_pte to true such that > >>>>> if force_pte is true then ignore the THP function check > >>>>> (/transparent_hugepage_adjust()). > >>>>> > >>>>> There could be an issue with the VM_PFNMAP flag setting and checking. > >>>>> For example consider a case where the mdev vendor driver register's > >>>>> the vma_fault handler named vma_mmio_fault(), which maps the > >>>>> host MMIO region in-turn calls remap_pfn_range() and maps > >>>>> the MMIO's vma space. Where, remap_pfn_range implicitly sets > >>>>> the VM_PFNMAP flag into vma->flags. > >>>> Could you give the name of the mdev vendor driver that triggers this issue? > >>>> I failed to find one according to your description. Thanks. > >>>> > >>> > >>> I think it would be fixed in driver side to set VM_PFNMAP in > >>> its mmap() callback (call_mmap()), like vfio PCI driver does. > >>> It means it won't be delayed until page fault is issued and > >>> remap_pfn_range() is called. It's determined from the beginning > >>> that the vma associated the mdev vendor driver is serving as > >>> PFN remapping purpose. So the vma should be populated completely, > >>> including the VM_PFNMAP flag before it becomes visible to user > >>> space. > > > > Why should that be a requirement? Lazy populating of the VMA should be > > perfectly acceptable if the fault can only happen on the CPU side. > > > > It isn't a requirement and the drivers needn't follow strictly. I checked > several drivers before looking into the patch and found almost all the > drivers have VM_PFNMAP set at mmap() time. In drivers/vfio/vfio-pci.c, > there is a comment as below, but it doesn't reveal too much about why > we can't set VM_PFNMAP at fault time. > > static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma) > { > : > /* > * See remap_pfn_range(), called from vfio_pci_fault() but we can't > * change vm_flags within the fault handler. Set them now. > */ > vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP; > vma->vm_ops = &vfio_pci_mmap_ops; > > return 0; > } > > To set these flags in advance does have advantages. For example, > VM_DONTEXPAND prevents the vma to be merged with another > one. VM_DONTDUMP make this vma isn't eligible for > coredump. Otherwise, the address space, which is associated with the > vma is accessed and unnecessary page faults are triggered on > coredump. VM_IO and VM_PFNMAP avoids to walk the page frames > associated with the vma since we don't have valid PFN in the > mapping. But PCI clearly isn't the case we are dealing with here, and not everything is VFIO either. I can *today* create a driver that implements a mmap+fault handler, call mmap() on it, pass the result to a memslot, and get to the exact same result Santosh describes. No PCI, no VFIO, just a random driver. We are *required* to handle that. M.
On 4/22/2021 12:20 PM, Marc Zyngier wrote: > External email: Use caution opening links or attachments > > > On Thu, 22 Apr 2021 03:02:00 +0100, > Gavin Shan <gshan@redhat.com> wrote: >> >> Hi Marc, >> >> On 4/21/21 9:59 PM, Marc Zyngier wrote: >>> On Wed, 21 Apr 2021 07:17:44 +0100, >>> Keqian Zhu <zhukeqian1@huawei.com> wrote: >>>> On 2021/4/21 14:20, Gavin Shan wrote: >>>>> On 4/21/21 12:59 PM, Keqian Zhu wrote: >>>>>> On 2020/10/22 0:16, Santosh Shukla wrote: >>>>>>> The Commit:6d674e28 introduces a notion to detect and handle the >>>>>>> device mapping. The commit checks for the VM_PFNMAP flag is set >>>>>>> in vma->flags and if set then marks force_pte to true such that >>>>>>> if force_pte is true then ignore the THP function check >>>>>>> (/transparent_hugepage_adjust()). >>>>>>> >>>>>>> There could be an issue with the VM_PFNMAP flag setting and checking. >>>>>>> For example consider a case where the mdev vendor driver register's >>>>>>> the vma_fault handler named vma_mmio_fault(), which maps the >>>>>>> host MMIO region in-turn calls remap_pfn_range() and maps >>>>>>> the MMIO's vma space. Where, remap_pfn_range implicitly sets >>>>>>> the VM_PFNMAP flag into vma->flags. >>>>>> Could you give the name of the mdev vendor driver that triggers this issue? >>>>>> I failed to find one according to your description. Thanks. >>>>>> >>>>> >>>>> I think it would be fixed in driver side to set VM_PFNMAP in >>>>> its mmap() callback (call_mmap()), like vfio PCI driver does. >>>>> It means it won't be delayed until page fault is issued and >>>>> remap_pfn_range() is called. It's determined from the beginning >>>>> that the vma associated the mdev vendor driver is serving as >>>>> PFN remapping purpose. So the vma should be populated completely, >>>>> including the VM_PFNMAP flag before it becomes visible to user >>>>> space. >>> >>> Why should that be a requirement? Lazy populating of the VMA should be >>> perfectly acceptable if the fault can only happen on the CPU side. >>> >> >> It isn't a requirement and the drivers needn't follow strictly. I checked >> several drivers before looking into the patch and found almost all the >> drivers have VM_PFNMAP set at mmap() time. In drivers/vfio/vfio-pci.c, >> there is a comment as below, but it doesn't reveal too much about why >> we can't set VM_PFNMAP at fault time. >> >> static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma) >> { >> : >> /* >> * See remap_pfn_range(), called from vfio_pci_fault() but we can't >> * change vm_flags within the fault handler. Set them now. >> */ >> vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP; >> vma->vm_ops = &vfio_pci_mmap_ops; >> >> return 0; >> } >> >> To set these flags in advance does have advantages. For example, >> VM_DONTEXPAND prevents the vma to be merged with another >> one. VM_DONTDUMP make this vma isn't eligible for >> coredump. Otherwise, the address space, which is associated with the >> vma is accessed and unnecessary page faults are triggered on >> coredump. VM_IO and VM_PFNMAP avoids to walk the page frames >> associated with the vma since we don't have valid PFN in the >> mapping. > > But PCI clearly isn't the case we are dealing with here, and not > everything is VFIO either. I can *today* create a driver that > implements a mmap+fault handler, call mmap() on it, pass the result to > a memslot, and get to the exact same result Santosh describes. > > No PCI, no VFIO, just a random driver. We are *required* to handle > that. Agree with Marc here, that kernel should be able to handle it without VM_PFNMAP flag set in driver. For driver reference, you could check the V2 version of this patch that got accepted upstream and has details as-to how this can be reproduced using vfio-pci: https://www.spinics.net/lists/arm-kernel/msg848491.html > > M. > > -- > Without deviation from the norm, progress is not possible. >
On Thu, Apr 22, 2021 at 1:07 PM Tarun Gupta (SW-GPU) <targupta@nvidia.com> wrote: > > > > On 4/22/2021 12:20 PM, Marc Zyngier wrote: > > External email: Use caution opening links or attachments > > > > > > On Thu, 22 Apr 2021 03:02:00 +0100, > > Gavin Shan <gshan@redhat.com> wrote: > >> > >> Hi Marc, > >> > >> On 4/21/21 9:59 PM, Marc Zyngier wrote: > >>> On Wed, 21 Apr 2021 07:17:44 +0100, > >>> Keqian Zhu <zhukeqian1@huawei.com> wrote: > >>>> On 2021/4/21 14:20, Gavin Shan wrote: > >>>>> On 4/21/21 12:59 PM, Keqian Zhu wrote: > >>>>>> On 2020/10/22 0:16, Santosh Shukla wrote: > >>>>>>> The Commit:6d674e28 introduces a notion to detect and handle the > >>>>>>> device mapping. The commit checks for the VM_PFNMAP flag is set > >>>>>>> in vma->flags and if set then marks force_pte to true such that > >>>>>>> if force_pte is true then ignore the THP function check > >>>>>>> (/transparent_hugepage_adjust()). > >>>>>>> > >>>>>>> There could be an issue with the VM_PFNMAP flag setting and checking. > >>>>>>> For example consider a case where the mdev vendor driver register's > >>>>>>> the vma_fault handler named vma_mmio_fault(), which maps the > >>>>>>> host MMIO region in-turn calls remap_pfn_range() and maps > >>>>>>> the MMIO's vma space. Where, remap_pfn_range implicitly sets > >>>>>>> the VM_PFNMAP flag into vma->flags. > >>>>>> Could you give the name of the mdev vendor driver that triggers this issue? > >>>>>> I failed to find one according to your description. Thanks. > >>>>>> > >>>>> > >>>>> I think it would be fixed in driver side to set VM_PFNMAP in > >>>>> its mmap() callback (call_mmap()), like vfio PCI driver does. > >>>>> It means it won't be delayed until page fault is issued and > >>>>> remap_pfn_range() is called. It's determined from the beginning > >>>>> that the vma associated the mdev vendor driver is serving as > >>>>> PFN remapping purpose. So the vma should be populated completely, > >>>>> including the VM_PFNMAP flag before it becomes visible to user > >>>>> space. > >>> > >>> Why should that be a requirement? Lazy populating of the VMA should be > >>> perfectly acceptable if the fault can only happen on the CPU side. > >>> Right. Hi keqian, You can refer to case http://lkml.iu.edu/hypermail/linux/kernel/2010.3/00952.html (Sorry Guys, I am not with nvidia, but My quick input.) > >> > >> It isn't a requirement and the drivers needn't follow strictly. I checked > >> several drivers before looking into the patch and found almost all the > >> drivers have VM_PFNMAP set at mmap() time. In drivers/vfio/vfio-pci.c, > >> there is a comment as below, but it doesn't reveal too much about why > >> we can't set VM_PFNMAP at fault time. > >> > >> static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma) > >> { > >> : > >> /* > >> * See remap_pfn_range(), called from vfio_pci_fault() but we can't > >> * change vm_flags within the fault handler. Set them now. > >> */ > >> vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP; > >> vma->vm_ops = &vfio_pci_mmap_ops; > >> > >> return 0; > >> } > >> > >> To set these flags in advance does have advantages. For example, > >> VM_DONTEXPAND prevents the vma to be merged with another > >> one. VM_DONTDUMP make this vma isn't eligible for > >> coredump. Otherwise, the address space, which is associated with the > >> vma is accessed and unnecessary page faults are triggered on > >> coredump. VM_IO and VM_PFNMAP avoids to walk the page frames > >> associated with the vma since we don't have valid PFN in the > >> mapping. > > > > But PCI clearly isn't the case we are dealing with here, and not > > everything is VFIO either. I can *today* create a driver that > > implements a mmap+fault handler, call mmap() on it, pass the result to > > a memslot, and get to the exact same result Santosh describes. > > > > No PCI, no VFIO, just a random driver. We are *required* to handle > > that. > > Agree with Marc here, that kernel should be able to handle it without > VM_PFNMAP flag set in driver. > > For driver reference, you could check the V2 version of this patch that > got accepted upstream and has details as-to how this can be reproduced > using vfio-pci: https://www.spinics.net/lists/arm-kernel/msg848491.html > > > > > M. > > > > -- > > Without deviation from the norm, progress is not possible. > >
On 2021/4/22 16:00, Santosh Shukla wrote: > On Thu, Apr 22, 2021 at 1:07 PM Tarun Gupta (SW-GPU) > <targupta@nvidia.com> wrote: >> >> >> >> On 4/22/2021 12:20 PM, Marc Zyngier wrote: >>> External email: Use caution opening links or attachments >>> >>> >>> On Thu, 22 Apr 2021 03:02:00 +0100, >>> Gavin Shan <gshan@redhat.com> wrote: >>>> >>>> Hi Marc, >>>> >>>> On 4/21/21 9:59 PM, Marc Zyngier wrote: >>>>> On Wed, 21 Apr 2021 07:17:44 +0100, >>>>> Keqian Zhu <zhukeqian1@huawei.com> wrote: >>>>>> On 2021/4/21 14:20, Gavin Shan wrote: >>>>>>> On 4/21/21 12:59 PM, Keqian Zhu wrote: >>>>>>>> On 2020/10/22 0:16, Santosh Shukla wrote: >>>>>>>>> The Commit:6d674e28 introduces a notion to detect and handle the >>>>>>>>> device mapping. The commit checks for the VM_PFNMAP flag is set >>>>>>>>> in vma->flags and if set then marks force_pte to true such that >>>>>>>>> if force_pte is true then ignore the THP function check >>>>>>>>> (/transparent_hugepage_adjust()). >>>>>>>>> >>>>>>>>> There could be an issue with the VM_PFNMAP flag setting and checking. >>>>>>>>> For example consider a case where the mdev vendor driver register's >>>>>>>>> the vma_fault handler named vma_mmio_fault(), which maps the >>>>>>>>> host MMIO region in-turn calls remap_pfn_range() and maps >>>>>>>>> the MMIO's vma space. Where, remap_pfn_range implicitly sets >>>>>>>>> the VM_PFNMAP flag into vma->flags. >>>>>>>> Could you give the name of the mdev vendor driver that triggers this issue? >>>>>>>> I failed to find one according to your description. Thanks. >>>>>>>> >>>>>>> >>>>>>> I think it would be fixed in driver side to set VM_PFNMAP in >>>>>>> its mmap() callback (call_mmap()), like vfio PCI driver does. >>>>>>> It means it won't be delayed until page fault is issued and >>>>>>> remap_pfn_range() is called. It's determined from the beginning >>>>>>> that the vma associated the mdev vendor driver is serving as >>>>>>> PFN remapping purpose. So the vma should be populated completely, >>>>>>> including the VM_PFNMAP flag before it becomes visible to user >>>>>>> space. >>>>> >>>>> Why should that be a requirement? Lazy populating of the VMA should be >>>>> perfectly acceptable if the fault can only happen on the CPU side. >>>>> > > Right. > Hi keqian, > You can refer to case > http://lkml.iu.edu/hypermail/linux/kernel/2010.3/00952.html Hi Santosh, Yeah, thanks for that. BRs, Keqian
Hi Marc, On 4/22/21 4:50 PM, Marc Zyngier wrote: > On Thu, 22 Apr 2021 03:02:00 +0100, > Gavin Shan <gshan@redhat.com> wrote: >> On 4/21/21 9:59 PM, Marc Zyngier wrote: >>> On Wed, 21 Apr 2021 07:17:44 +0100, >>> Keqian Zhu <zhukeqian1@huawei.com> wrote: >>>> On 2021/4/21 14:20, Gavin Shan wrote: >>>>> On 4/21/21 12:59 PM, Keqian Zhu wrote: >>>>>> On 2020/10/22 0:16, Santosh Shukla wrote: >>>>>>> The Commit:6d674e28 introduces a notion to detect and handle the >>>>>>> device mapping. The commit checks for the VM_PFNMAP flag is set >>>>>>> in vma->flags and if set then marks force_pte to true such that >>>>>>> if force_pte is true then ignore the THP function check >>>>>>> (/transparent_hugepage_adjust()). >>>>>>> >>>>>>> There could be an issue with the VM_PFNMAP flag setting and checking. >>>>>>> For example consider a case where the mdev vendor driver register's >>>>>>> the vma_fault handler named vma_mmio_fault(), which maps the >>>>>>> host MMIO region in-turn calls remap_pfn_range() and maps >>>>>>> the MMIO's vma space. Where, remap_pfn_range implicitly sets >>>>>>> the VM_PFNMAP flag into vma->flags. >>>>>> Could you give the name of the mdev vendor driver that triggers this issue? >>>>>> I failed to find one according to your description. Thanks. >>>>>> >>>>> >>>>> I think it would be fixed in driver side to set VM_PFNMAP in >>>>> its mmap() callback (call_mmap()), like vfio PCI driver does. >>>>> It means it won't be delayed until page fault is issued and >>>>> remap_pfn_range() is called. It's determined from the beginning >>>>> that the vma associated the mdev vendor driver is serving as >>>>> PFN remapping purpose. So the vma should be populated completely, >>>>> including the VM_PFNMAP flag before it becomes visible to user >>>>> space. >>> >>> Why should that be a requirement? Lazy populating of the VMA should be >>> perfectly acceptable if the fault can only happen on the CPU side. >>> >> >> It isn't a requirement and the drivers needn't follow strictly. I checked >> several drivers before looking into the patch and found almost all the >> drivers have VM_PFNMAP set at mmap() time. In drivers/vfio/vfio-pci.c, >> there is a comment as below, but it doesn't reveal too much about why >> we can't set VM_PFNMAP at fault time. >> >> static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma) >> { >> : >> /* >> * See remap_pfn_range(), called from vfio_pci_fault() but we can't >> * change vm_flags within the fault handler. Set them now. >> */ >> vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP; >> vma->vm_ops = &vfio_pci_mmap_ops; >> >> return 0; >> } >> >> To set these flags in advance does have advantages. For example, >> VM_DONTEXPAND prevents the vma to be merged with another >> one. VM_DONTDUMP make this vma isn't eligible for >> coredump. Otherwise, the address space, which is associated with the >> vma is accessed and unnecessary page faults are triggered on >> coredump. VM_IO and VM_PFNMAP avoids to walk the page frames >> associated with the vma since we don't have valid PFN in the >> mapping. > > But PCI clearly isn't the case we are dealing with here, and not > everything is VFIO either. I can *today* create a driver that > implements a mmap+fault handler, call mmap() on it, pass the result to > a memslot, and get to the exact same result Santosh describes. > > No PCI, no VFIO, just a random driver. We are *required* to handle > that. > hmm, ok. I was thinking it's related to VFIO mdev driver when Santosh was talking about "mdev driver". Anyway, it's always nice to support the case :) Thanks, Gavin
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 3d26b47..ff15357 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1947,7 +1947,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, * If we are not forced to use page mapping, check if we are * backed by a THP and thus use block mapping if possible. */ - if (vma_pagesize == PAGE_SIZE && !force_pte) + if (vma_pagesize == PAGE_SIZE && !force_pte && !is_iomap(flags)) vma_pagesize = transparent_hugepage_adjust(memslot, hva, &pfn, &fault_ipa); if (writable)