diff mbox series

[1/2] dma-buf: Require VM_PFNMAP vma for mmap

Message ID 20210223105951.912577-1-daniel.vetter@ffwll.ch (mailing list archive)
State New, archived
Headers show
Series [1/2] dma-buf: Require VM_PFNMAP vma for mmap | expand

Commit Message

Daniel Vetter Feb. 23, 2021, 10:59 a.m. UTC
tldr; DMA buffers aren't normal memory, expecting that you can use
them like that (like calling get_user_pages works, or that they're
accounting like any other normal memory) cannot be guaranteed.

Since some userspace only runs on integrated devices, where all
buffers are actually all resident system memory, there's a huge
temptation to assume that a struct page is always present and useable
like for any more pagecache backed mmap. This has the potential to
result in a uapi nightmare.

To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which
blocks get_user_pages and all the other struct page based
infrastructure for everyone. In spirit this is the uapi counterpart to
the kernel-internal CONFIG_DMABUF_DEBUG.

Motivated by a recent patch which wanted to swich the system dma-buf
heap to vm_insert_page instead of vm_insert_pfn.

v2:

Jason brought up that we also want to guarantee that all ptes have the
pte_special flag set, to catch fast get_user_pages (on architectures
that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would
still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.

From auditing the various functions to insert pfn pte entires
(vm_insert_pfn_prot, remap_pfn_range and all it's callers like
dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so
this should be the correct flag to check for.

References: https://lore.kernel.org/lkml/CAKMK7uHi+mG0z0HUmNt13QCCvutuRVjpcR0NjRL12k-WbWzkRg@mail.gmail.com/
Acked-by: Christian König <christian.koenig@amd.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
---
 drivers/dma-buf/dma-buf.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

Comments

Thomas Hellström (Intel) Feb. 24, 2021, 7:46 a.m. UTC | #1
On 2/23/21 11:59 AM, Daniel Vetter wrote:
> tldr; DMA buffers aren't normal memory, expecting that you can use
> them like that (like calling get_user_pages works, or that they're
> accounting like any other normal memory) cannot be guaranteed.
>
> Since some userspace only runs on integrated devices, where all
> buffers are actually all resident system memory, there's a huge
> temptation to assume that a struct page is always present and useable
> like for any more pagecache backed mmap. This has the potential to
> result in a uapi nightmare.
>
> To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which
> blocks get_user_pages and all the other struct page based
> infrastructure for everyone. In spirit this is the uapi counterpart to
> the kernel-internal CONFIG_DMABUF_DEBUG.
>
> Motivated by a recent patch which wanted to swich the system dma-buf
> heap to vm_insert_page instead of vm_insert_pfn.
>
> v2:
>
> Jason brought up that we also want to guarantee that all ptes have the
> pte_special flag set, to catch fast get_user_pages (on architectures
> that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would
> still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.
>
>  From auditing the various functions to insert pfn pte entires
> (vm_insert_pfn_prot, remap_pfn_range and all it's callers like
> dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so
> this should be the correct flag to check for.
>
If we require VM_PFNMAP, for ordinary page mappings, we also need to 
disallow COW mappings, since it will not work on architectures that 
don't have CONFIG_ARCH_HAS_PTE_SPECIAL, (see the docs for vm_normal_page()).

Also worth noting is the comment in  ttm_bo_mmap_vma_setup() with 
possible performance implications with x86 + PAT + VM_PFNMAP + normal 
pages. That's a very old comment, though, and might not be valid anymore.

/Thomas
Daniel Vetter Feb. 24, 2021, 8:45 a.m. UTC | #2
On Wed, Feb 24, 2021 at 8:46 AM Thomas Hellström (Intel)
<thomas_os@shipmail.org> wrote:
>
>
> On 2/23/21 11:59 AM, Daniel Vetter wrote:
> > tldr; DMA buffers aren't normal memory, expecting that you can use
> > them like that (like calling get_user_pages works, or that they're
> > accounting like any other normal memory) cannot be guaranteed.
> >
> > Since some userspace only runs on integrated devices, where all
> > buffers are actually all resident system memory, there's a huge
> > temptation to assume that a struct page is always present and useable
> > like for any more pagecache backed mmap. This has the potential to
> > result in a uapi nightmare.
> >
> > To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which
> > blocks get_user_pages and all the other struct page based
> > infrastructure for everyone. In spirit this is the uapi counterpart to
> > the kernel-internal CONFIG_DMABUF_DEBUG.
> >
> > Motivated by a recent patch which wanted to swich the system dma-buf
> > heap to vm_insert_page instead of vm_insert_pfn.
> >
> > v2:
> >
> > Jason brought up that we also want to guarantee that all ptes have the
> > pte_special flag set, to catch fast get_user_pages (on architectures
> > that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would
> > still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.
> >
> >  From auditing the various functions to insert pfn pte entires
> > (vm_insert_pfn_prot, remap_pfn_range and all it's callers like
> > dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so
> > this should be the correct flag to check for.
> >
> If we require VM_PFNMAP, for ordinary page mappings, we also need to
> disallow COW mappings, since it will not work on architectures that
> don't have CONFIG_ARCH_HAS_PTE_SPECIAL, (see the docs for vm_normal_page()).

Hm I figured everyone just uses MAP_SHARED for buffer objects since
COW really makes absolutely no sense. How would we enforce this?

> Also worth noting is the comment in  ttm_bo_mmap_vma_setup() with
> possible performance implications with x86 + PAT + VM_PFNMAP + normal
> pages. That's a very old comment, though, and might not be valid anymore.

I think that's why ttm has a page cache for these, because it indeed
sucks. The PAT changes on pages are rather expensive.

There is still an issue for iomem mappings, because the PAT validation
does a linear walk of the resource tree (lol) for every vm_insert_pfn.
But for i915 at least this is fixed by using the io_mapping
infrastructure, which does the PAT reservation only once when you set
up the mapping area at driver load.

Also TTM uses VM_PFNMAP right now for everything, so it can't be a
problem that hurts much :-)
-Daniel
Thomas Hellström (Intel) Feb. 24, 2021, 9:15 a.m. UTC | #3
On 2/24/21 9:45 AM, Daniel Vetter wrote:
> On Wed, Feb 24, 2021 at 8:46 AM Thomas Hellström (Intel)
> <thomas_os@shipmail.org> wrote:
>>
>> On 2/23/21 11:59 AM, Daniel Vetter wrote:
>>> tldr; DMA buffers aren't normal memory, expecting that you can use
>>> them like that (like calling get_user_pages works, or that they're
>>> accounting like any other normal memory) cannot be guaranteed.
>>>
>>> Since some userspace only runs on integrated devices, where all
>>> buffers are actually all resident system memory, there's a huge
>>> temptation to assume that a struct page is always present and useable
>>> like for any more pagecache backed mmap. This has the potential to
>>> result in a uapi nightmare.
>>>
>>> To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which
>>> blocks get_user_pages and all the other struct page based
>>> infrastructure for everyone. In spirit this is the uapi counterpart to
>>> the kernel-internal CONFIG_DMABUF_DEBUG.
>>>
>>> Motivated by a recent patch which wanted to swich the system dma-buf
>>> heap to vm_insert_page instead of vm_insert_pfn.
>>>
>>> v2:
>>>
>>> Jason brought up that we also want to guarantee that all ptes have the
>>> pte_special flag set, to catch fast get_user_pages (on architectures
>>> that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would
>>> still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.
>>>
>>>   From auditing the various functions to insert pfn pte entires
>>> (vm_insert_pfn_prot, remap_pfn_range and all it's callers like
>>> dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so
>>> this should be the correct flag to check for.
>>>
>> If we require VM_PFNMAP, for ordinary page mappings, we also need to
>> disallow COW mappings, since it will not work on architectures that
>> don't have CONFIG_ARCH_HAS_PTE_SPECIAL, (see the docs for vm_normal_page()).
> Hm I figured everyone just uses MAP_SHARED for buffer objects since
> COW really makes absolutely no sense. How would we enforce this?

Perhaps returning -EINVAL on is_cow_mapping() at mmap time. Either that 
or allowing MIXEDMAP.

>> Also worth noting is the comment in  ttm_bo_mmap_vma_setup() with
>> possible performance implications with x86 + PAT + VM_PFNMAP + normal
>> pages. That's a very old comment, though, and might not be valid anymore.
> I think that's why ttm has a page cache for these, because it indeed
> sucks. The PAT changes on pages are rather expensive.

IIRC the page cache was implemented because of the slowness of the 
caching mode transition itself, more specifically the wbinvd() call + 
global TLB flush.

>
> There is still an issue for iomem mappings, because the PAT validation
> does a linear walk of the resource tree (lol) for every vm_insert_pfn.
> But for i915 at least this is fixed by using the io_mapping
> infrastructure, which does the PAT reservation only once when you set
> up the mapping area at driver load.

Yes, I guess that was the issue that the comment describes, but the 
issue wasn't there with vm_insert_mixed() + VM_MIXEDMAP.

>
> Also TTM uses VM_PFNMAP right now for everything, so it can't be a
> problem that hurts much :-)

Hmm, both 5.11 and drm-tip appears to still use MIXEDMAP?

https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/ttm/ttm_bo_vm.c#L554

> -Daniel

/Thomas
Daniel Vetter Feb. 24, 2021, 9:31 a.m. UTC | #4
On Wed, Feb 24, 2021 at 10:16 AM Thomas Hellström (Intel)
<thomas_os@shipmail.org> wrote:
>
>
> On 2/24/21 9:45 AM, Daniel Vetter wrote:
> > On Wed, Feb 24, 2021 at 8:46 AM Thomas Hellström (Intel)
> > <thomas_os@shipmail.org> wrote:
> >>
> >> On 2/23/21 11:59 AM, Daniel Vetter wrote:
> >>> tldr; DMA buffers aren't normal memory, expecting that you can use
> >>> them like that (like calling get_user_pages works, or that they're
> >>> accounting like any other normal memory) cannot be guaranteed.
> >>>
> >>> Since some userspace only runs on integrated devices, where all
> >>> buffers are actually all resident system memory, there's a huge
> >>> temptation to assume that a struct page is always present and useable
> >>> like for any more pagecache backed mmap. This has the potential to
> >>> result in a uapi nightmare.
> >>>
> >>> To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which
> >>> blocks get_user_pages and all the other struct page based
> >>> infrastructure for everyone. In spirit this is the uapi counterpart to
> >>> the kernel-internal CONFIG_DMABUF_DEBUG.
> >>>
> >>> Motivated by a recent patch which wanted to swich the system dma-buf
> >>> heap to vm_insert_page instead of vm_insert_pfn.
> >>>
> >>> v2:
> >>>
> >>> Jason brought up that we also want to guarantee that all ptes have the
> >>> pte_special flag set, to catch fast get_user_pages (on architectures
> >>> that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would
> >>> still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.
> >>>
> >>>   From auditing the various functions to insert pfn pte entires
> >>> (vm_insert_pfn_prot, remap_pfn_range and all it's callers like
> >>> dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so
> >>> this should be the correct flag to check for.
> >>>
> >> If we require VM_PFNMAP, for ordinary page mappings, we also need to
> >> disallow COW mappings, since it will not work on architectures that
> >> don't have CONFIG_ARCH_HAS_PTE_SPECIAL, (see the docs for vm_normal_page()).
> > Hm I figured everyone just uses MAP_SHARED for buffer objects since
> > COW really makes absolutely no sense. How would we enforce this?
>
> Perhaps returning -EINVAL on is_cow_mapping() at mmap time. Either that
> or allowing MIXEDMAP.
>
> >> Also worth noting is the comment in  ttm_bo_mmap_vma_setup() with
> >> possible performance implications with x86 + PAT + VM_PFNMAP + normal
> >> pages. That's a very old comment, though, and might not be valid anymore.
> > I think that's why ttm has a page cache for these, because it indeed
> > sucks. The PAT changes on pages are rather expensive.
>
> IIRC the page cache was implemented because of the slowness of the
> caching mode transition itself, more specifically the wbinvd() call +
> global TLB flush.
>
> >
> > There is still an issue for iomem mappings, because the PAT validation
> > does a linear walk of the resource tree (lol) for every vm_insert_pfn.
> > But for i915 at least this is fixed by using the io_mapping
> > infrastructure, which does the PAT reservation only once when you set
> > up the mapping area at driver load.
>
> Yes, I guess that was the issue that the comment describes, but the
> issue wasn't there with vm_insert_mixed() + VM_MIXEDMAP.
>
> >
> > Also TTM uses VM_PFNMAP right now for everything, so it can't be a
> > problem that hurts much :-)
>
> Hmm, both 5.11 and drm-tip appears to still use MIXEDMAP?
>
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/ttm/ttm_bo_vm.c#L554

Uh that's bad, because mixed maps pointing at struct page wont stop
gup. At least afaik.

Christian, do we need to patch this up, and maybe fix up ttm fault
handler to use io_mapping so the vm_insert_pfn stuff is fast?
-Daniel
Christian König Feb. 25, 2021, 10:28 a.m. UTC | #5
Am 24.02.21 um 10:31 schrieb Daniel Vetter:
> On Wed, Feb 24, 2021 at 10:16 AM Thomas Hellström (Intel)
> <thomas_os@shipmail.org> wrote:
>>
>> On 2/24/21 9:45 AM, Daniel Vetter wrote:
>>> On Wed, Feb 24, 2021 at 8:46 AM Thomas Hellström (Intel)
>>> <thomas_os@shipmail.org> wrote:
>>>> On 2/23/21 11:59 AM, Daniel Vetter wrote:
>>>>> tldr; DMA buffers aren't normal memory, expecting that you can use
>>>>> them like that (like calling get_user_pages works, or that they're
>>>>> accounting like any other normal memory) cannot be guaranteed.
>>>>>
>>>>> Since some userspace only runs on integrated devices, where all
>>>>> buffers are actually all resident system memory, there's a huge
>>>>> temptation to assume that a struct page is always present and useable
>>>>> like for any more pagecache backed mmap. This has the potential to
>>>>> result in a uapi nightmare.
>>>>>
>>>>> To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which
>>>>> blocks get_user_pages and all the other struct page based
>>>>> infrastructure for everyone. In spirit this is the uapi counterpart to
>>>>> the kernel-internal CONFIG_DMABUF_DEBUG.
>>>>>
>>>>> Motivated by a recent patch which wanted to swich the system dma-buf
>>>>> heap to vm_insert_page instead of vm_insert_pfn.
>>>>>
>>>>> v2:
>>>>>
>>>>> Jason brought up that we also want to guarantee that all ptes have the
>>>>> pte_special flag set, to catch fast get_user_pages (on architectures
>>>>> that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would
>>>>> still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.
>>>>>
>>>>>    From auditing the various functions to insert pfn pte entires
>>>>> (vm_insert_pfn_prot, remap_pfn_range and all it's callers like
>>>>> dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so
>>>>> this should be the correct flag to check for.
>>>>>
>>>> If we require VM_PFNMAP, for ordinary page mappings, we also need to
>>>> disallow COW mappings, since it will not work on architectures that
>>>> don't have CONFIG_ARCH_HAS_PTE_SPECIAL, (see the docs for vm_normal_page()).
>>> Hm I figured everyone just uses MAP_SHARED for buffer objects since
>>> COW really makes absolutely no sense. How would we enforce this?
>> Perhaps returning -EINVAL on is_cow_mapping() at mmap time. Either that
>> or allowing MIXEDMAP.
>>
>>>> Also worth noting is the comment in  ttm_bo_mmap_vma_setup() with
>>>> possible performance implications with x86 + PAT + VM_PFNMAP + normal
>>>> pages. That's a very old comment, though, and might not be valid anymore.
>>> I think that's why ttm has a page cache for these, because it indeed
>>> sucks. The PAT changes on pages are rather expensive.
>> IIRC the page cache was implemented because of the slowness of the
>> caching mode transition itself, more specifically the wbinvd() call +
>> global TLB flush.

Yes, exactly that. The global TLB flush is what really breaks our neck 
here from a performance perspective.

>>> There is still an issue for iomem mappings, because the PAT validation
>>> does a linear walk of the resource tree (lol) for every vm_insert_pfn.
>>> But for i915 at least this is fixed by using the io_mapping
>>> infrastructure, which does the PAT reservation only once when you set
>>> up the mapping area at driver load.
>> Yes, I guess that was the issue that the comment describes, but the
>> issue wasn't there with vm_insert_mixed() + VM_MIXEDMAP.
>>
>>> Also TTM uses VM_PFNMAP right now for everything, so it can't be a
>>> problem that hurts much :-)
>> Hmm, both 5.11 and drm-tip appears to still use MIXEDMAP?
>>
>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/ttm/ttm_bo_vm.c#L554
> Uh that's bad, because mixed maps pointing at struct page wont stop
> gup. At least afaik.

Hui? I'm pretty sure MIXEDMAP stops gup as well. Otherwise we would have 
already seen tons of problems with the page cache.

Regards,
Christian.

> Christian, do we need to patch this up, and maybe fix up ttm fault
> handler to use io_mapping so the vm_insert_pfn stuff is fast?
> -Daniel
Christian König Feb. 25, 2021, 10:30 a.m. UTC | #6
Am 24.02.21 um 19:46 schrieb Jason Gunthorpe:
> On Wed, Feb 24, 2021 at 09:45:51AM +0100, Daniel Vetter wrote:
>
>> Hm I figured everyone just uses MAP_SHARED for buffer objects since
>> COW really makes absolutely no sense. How would we enforce this?
> In RDMA we test
>
> drivers/infiniband/core/ib_core_uverbs.c:       if (!(vma->vm_flags & VM_SHARED))
>
> During mmap to reject use of MAP_PRIVATE on BAR pages.

That's a really good idea. MAP_PRIVATE and any driver mappings doesn't 
really work at all.

Christian.

>
> Jason
Daniel Vetter Feb. 25, 2021, 10:44 a.m. UTC | #7
On Thu, Feb 25, 2021 at 11:28:31AM +0100, Christian König wrote:
> Am 24.02.21 um 10:31 schrieb Daniel Vetter:
> > On Wed, Feb 24, 2021 at 10:16 AM Thomas Hellström (Intel)
> > <thomas_os@shipmail.org> wrote:
> > > 
> > > On 2/24/21 9:45 AM, Daniel Vetter wrote:
> > > > On Wed, Feb 24, 2021 at 8:46 AM Thomas Hellström (Intel)
> > > > <thomas_os@shipmail.org> wrote:
> > > > > On 2/23/21 11:59 AM, Daniel Vetter wrote:
> > > > > > tldr; DMA buffers aren't normal memory, expecting that you can use
> > > > > > them like that (like calling get_user_pages works, or that they're
> > > > > > accounting like any other normal memory) cannot be guaranteed.
> > > > > > 
> > > > > > Since some userspace only runs on integrated devices, where all
> > > > > > buffers are actually all resident system memory, there's a huge
> > > > > > temptation to assume that a struct page is always present and useable
> > > > > > like for any more pagecache backed mmap. This has the potential to
> > > > > > result in a uapi nightmare.
> > > > > > 
> > > > > > To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which
> > > > > > blocks get_user_pages and all the other struct page based
> > > > > > infrastructure for everyone. In spirit this is the uapi counterpart to
> > > > > > the kernel-internal CONFIG_DMABUF_DEBUG.
> > > > > > 
> > > > > > Motivated by a recent patch which wanted to swich the system dma-buf
> > > > > > heap to vm_insert_page instead of vm_insert_pfn.
> > > > > > 
> > > > > > v2:
> > > > > > 
> > > > > > Jason brought up that we also want to guarantee that all ptes have the
> > > > > > pte_special flag set, to catch fast get_user_pages (on architectures
> > > > > > that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would
> > > > > > still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.
> > > > > > 
> > > > > >    From auditing the various functions to insert pfn pte entires
> > > > > > (vm_insert_pfn_prot, remap_pfn_range and all it's callers like
> > > > > > dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so
> > > > > > this should be the correct flag to check for.
> > > > > > 
> > > > > If we require VM_PFNMAP, for ordinary page mappings, we also need to
> > > > > disallow COW mappings, since it will not work on architectures that
> > > > > don't have CONFIG_ARCH_HAS_PTE_SPECIAL, (see the docs for vm_normal_page()).
> > > > Hm I figured everyone just uses MAP_SHARED for buffer objects since
> > > > COW really makes absolutely no sense. How would we enforce this?
> > > Perhaps returning -EINVAL on is_cow_mapping() at mmap time. Either that
> > > or allowing MIXEDMAP.
> > > 
> > > > > Also worth noting is the comment in  ttm_bo_mmap_vma_setup() with
> > > > > possible performance implications with x86 + PAT + VM_PFNMAP + normal
> > > > > pages. That's a very old comment, though, and might not be valid anymore.
> > > > I think that's why ttm has a page cache for these, because it indeed
> > > > sucks. The PAT changes on pages are rather expensive.
> > > IIRC the page cache was implemented because of the slowness of the
> > > caching mode transition itself, more specifically the wbinvd() call +
> > > global TLB flush.
> 
> Yes, exactly that. The global TLB flush is what really breaks our neck here
> from a performance perspective.
> 
> > > > There is still an issue for iomem mappings, because the PAT validation
> > > > does a linear walk of the resource tree (lol) for every vm_insert_pfn.
> > > > But for i915 at least this is fixed by using the io_mapping
> > > > infrastructure, which does the PAT reservation only once when you set
> > > > up the mapping area at driver load.
> > > Yes, I guess that was the issue that the comment describes, but the
> > > issue wasn't there with vm_insert_mixed() + VM_MIXEDMAP.
> > > 
> > > > Also TTM uses VM_PFNMAP right now for everything, so it can't be a
> > > > problem that hurts much :-)
> > > Hmm, both 5.11 and drm-tip appears to still use MIXEDMAP?
> > > 
> > > https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/ttm/ttm_bo_vm.c#L554
> > Uh that's bad, because mixed maps pointing at struct page wont stop
> > gup. At least afaik.
> 
> Hui? I'm pretty sure MIXEDMAP stops gup as well. Otherwise we would have
> already seen tons of problems with the page cache.

On any architecture which has CONFIG_ARCH_HAS_PTE_SPECIAL vm_insert_mixed
boils down to vm_insert_pfn wrt gup. And special pte stops gup fast path.

But if you don't have VM_IO or VM_PFNMAP set, then I'm not seeing how
you're stopping gup slow path. See check_vma_flags() in mm/gup.c.

Also if you don't have CONFIG_ARCH_HAS_PTE_SPECIAL then I don't think
vm_insert_mixed even works on iomem pfns. There's the devmap exception,
but we're not devmap. Worse ttm abuses some accidental codepath to smuggle
in hugepte support by intentionally not being devmap.

So I'm really not sure this works as we think it should. Maybe good to do
a quick test program on amdgpu with a buffer in system memory only and try
to do direct io into it. If it works, you have a problem, and a bad one.
-Daniel

> 
> Regards,
> Christian.
> 
> > Christian, do we need to patch this up, and maybe fix up ttm fault
> > handler to use io_mapping so the vm_insert_pfn stuff is fast?
> > -Daniel
>
Daniel Vetter Feb. 25, 2021, 10:45 a.m. UTC | #8
On Thu, Feb 25, 2021 at 11:30:23AM +0100, Christian König wrote:
> 
> 
> Am 24.02.21 um 19:46 schrieb Jason Gunthorpe:
> > On Wed, Feb 24, 2021 at 09:45:51AM +0100, Daniel Vetter wrote:
> > 
> > > Hm I figured everyone just uses MAP_SHARED for buffer objects since
> > > COW really makes absolutely no sense. How would we enforce this?
> > In RDMA we test
> > 
> > drivers/infiniband/core/ib_core_uverbs.c:       if (!(vma->vm_flags & VM_SHARED))
> > 
> > During mmap to reject use of MAP_PRIVATE on BAR pages.
> 
> That's a really good idea. MAP_PRIVATE and any driver mappings doesn't
> really work at all.

Yeah I feel like this is the next patch we need to add on this little
series of locking down dma-buf mmap semantics. Probably should also push
these into drm gem mmap code (and maybe ttm can switch over to that, it's
really the same).

One at a time.
-Daniel
> 
> Christian.
> 
> > 
> > Jason
>
Daniel Vetter Feb. 25, 2021, 3:49 p.m. UTC | #9
On Thu, Feb 25, 2021 at 11:44 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Thu, Feb 25, 2021 at 11:28:31AM +0100, Christian König wrote:
> > Am 24.02.21 um 10:31 schrieb Daniel Vetter:
> > > On Wed, Feb 24, 2021 at 10:16 AM Thomas Hellström (Intel)
> > > <thomas_os@shipmail.org> wrote:
> > > >
> > > > On 2/24/21 9:45 AM, Daniel Vetter wrote:
> > > > > On Wed, Feb 24, 2021 at 8:46 AM Thomas Hellström (Intel)
> > > > > <thomas_os@shipmail.org> wrote:
> > > > > > On 2/23/21 11:59 AM, Daniel Vetter wrote:
> > > > > > > tldr; DMA buffers aren't normal memory, expecting that you can use
> > > > > > > them like that (like calling get_user_pages works, or that they're
> > > > > > > accounting like any other normal memory) cannot be guaranteed.
> > > > > > >
> > > > > > > Since some userspace only runs on integrated devices, where all
> > > > > > > buffers are actually all resident system memory, there's a huge
> > > > > > > temptation to assume that a struct page is always present and useable
> > > > > > > like for any more pagecache backed mmap. This has the potential to
> > > > > > > result in a uapi nightmare.
> > > > > > >
> > > > > > > To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which
> > > > > > > blocks get_user_pages and all the other struct page based
> > > > > > > infrastructure for everyone. In spirit this is the uapi counterpart to
> > > > > > > the kernel-internal CONFIG_DMABUF_DEBUG.
> > > > > > >
> > > > > > > Motivated by a recent patch which wanted to swich the system dma-buf
> > > > > > > heap to vm_insert_page instead of vm_insert_pfn.
> > > > > > >
> > > > > > > v2:
> > > > > > >
> > > > > > > Jason brought up that we also want to guarantee that all ptes have the
> > > > > > > pte_special flag set, to catch fast get_user_pages (on architectures
> > > > > > > that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would
> > > > > > > still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.
> > > > > > >
> > > > > > >    From auditing the various functions to insert pfn pte entires
> > > > > > > (vm_insert_pfn_prot, remap_pfn_range and all it's callers like
> > > > > > > dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so
> > > > > > > this should be the correct flag to check for.
> > > > > > >
> > > > > > If we require VM_PFNMAP, for ordinary page mappings, we also need to
> > > > > > disallow COW mappings, since it will not work on architectures that
> > > > > > don't have CONFIG_ARCH_HAS_PTE_SPECIAL, (see the docs for vm_normal_page()).
> > > > > Hm I figured everyone just uses MAP_SHARED for buffer objects since
> > > > > COW really makes absolutely no sense. How would we enforce this?
> > > > Perhaps returning -EINVAL on is_cow_mapping() at mmap time. Either that
> > > > or allowing MIXEDMAP.
> > > >
> > > > > > Also worth noting is the comment in  ttm_bo_mmap_vma_setup() with
> > > > > > possible performance implications with x86 + PAT + VM_PFNMAP + normal
> > > > > > pages. That's a very old comment, though, and might not be valid anymore.
> > > > > I think that's why ttm has a page cache for these, because it indeed
> > > > > sucks. The PAT changes on pages are rather expensive.
> > > > IIRC the page cache was implemented because of the slowness of the
> > > > caching mode transition itself, more specifically the wbinvd() call +
> > > > global TLB flush.
> >
> > Yes, exactly that. The global TLB flush is what really breaks our neck here
> > from a performance perspective.
> >
> > > > > There is still an issue for iomem mappings, because the PAT validation
> > > > > does a linear walk of the resource tree (lol) for every vm_insert_pfn.
> > > > > But for i915 at least this is fixed by using the io_mapping
> > > > > infrastructure, which does the PAT reservation only once when you set
> > > > > up the mapping area at driver load.
> > > > Yes, I guess that was the issue that the comment describes, but the
> > > > issue wasn't there with vm_insert_mixed() + VM_MIXEDMAP.
> > > >
> > > > > Also TTM uses VM_PFNMAP right now for everything, so it can't be a
> > > > > problem that hurts much :-)
> > > > Hmm, both 5.11 and drm-tip appears to still use MIXEDMAP?
> > > >
> > > > https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/ttm/ttm_bo_vm.c#L554
> > > Uh that's bad, because mixed maps pointing at struct page wont stop
> > > gup. At least afaik.
> >
> > Hui? I'm pretty sure MIXEDMAP stops gup as well. Otherwise we would have
> > already seen tons of problems with the page cache.
>
> On any architecture which has CONFIG_ARCH_HAS_PTE_SPECIAL vm_insert_mixed
> boils down to vm_insert_pfn wrt gup. And special pte stops gup fast path.
>
> But if you don't have VM_IO or VM_PFNMAP set, then I'm not seeing how
> you're stopping gup slow path. See check_vma_flags() in mm/gup.c.
>
> Also if you don't have CONFIG_ARCH_HAS_PTE_SPECIAL then I don't think
> vm_insert_mixed even works on iomem pfns. There's the devmap exception,
> but we're not devmap. Worse ttm abuses some accidental codepath to smuggle
> in hugepte support by intentionally not being devmap.
>
> So I'm really not sure this works as we think it should. Maybe good to do
> a quick test program on amdgpu with a buffer in system memory only and try
> to do direct io into it. If it works, you have a problem, and a bad one.

That's probably impossible, since a quick git grep shows that pretty
much anything reasonable has special ptes: arc, arm, arm64, powerpc,
riscv, s390, sh, sparc, x86. I don't think you'll have a platform
where you can plug an amdgpu in and actually exercise the bug :-)

So maybe we should just switch over to VM_PFNMAP for ttm for more clarity?
-Daniel


>
> >
> > Regards,
> > Christian.
> >
> > > Christian, do we need to patch this up, and maybe fix up ttm fault
> > > handler to use io_mapping so the vm_insert_pfn stuff is fast?
> > > -Daniel
> >
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
Christian König Feb. 25, 2021, 4:53 p.m. UTC | #10
Am 25.02.21 um 16:49 schrieb Daniel Vetter:
> On Thu, Feb 25, 2021 at 11:44 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>> On Thu, Feb 25, 2021 at 11:28:31AM +0100, Christian König wrote:
>>> Am 24.02.21 um 10:31 schrieb Daniel Vetter:
>>>> On Wed, Feb 24, 2021 at 10:16 AM Thomas Hellström (Intel)
>>>> <thomas_os@shipmail.org> wrote:
>>>>> On 2/24/21 9:45 AM, Daniel Vetter wrote:
>>>>>> On Wed, Feb 24, 2021 at 8:46 AM Thomas Hellström (Intel)
>>>>>> <thomas_os@shipmail.org> wrote:
>>>>>>> On 2/23/21 11:59 AM, Daniel Vetter wrote:
>>>>>>>> tldr; DMA buffers aren't normal memory, expecting that you can use
>>>>>>>> them like that (like calling get_user_pages works, or that they're
>>>>>>>> accounting like any other normal memory) cannot be guaranteed.
>>>>>>>>
>>>>>>>> Since some userspace only runs on integrated devices, where all
>>>>>>>> buffers are actually all resident system memory, there's a huge
>>>>>>>> temptation to assume that a struct page is always present and useable
>>>>>>>> like for any more pagecache backed mmap. This has the potential to
>>>>>>>> result in a uapi nightmare.
>>>>>>>>
>>>>>>>> To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which
>>>>>>>> blocks get_user_pages and all the other struct page based
>>>>>>>> infrastructure for everyone. In spirit this is the uapi counterpart to
>>>>>>>> the kernel-internal CONFIG_DMABUF_DEBUG.
>>>>>>>>
>>>>>>>> Motivated by a recent patch which wanted to swich the system dma-buf
>>>>>>>> heap to vm_insert_page instead of vm_insert_pfn.
>>>>>>>>
>>>>>>>> v2:
>>>>>>>>
>>>>>>>> Jason brought up that we also want to guarantee that all ptes have the
>>>>>>>> pte_special flag set, to catch fast get_user_pages (on architectures
>>>>>>>> that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would
>>>>>>>> still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.
>>>>>>>>
>>>>>>>>     From auditing the various functions to insert pfn pte entires
>>>>>>>> (vm_insert_pfn_prot, remap_pfn_range and all it's callers like
>>>>>>>> dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so
>>>>>>>> this should be the correct flag to check for.
>>>>>>>>
>>>>>>> If we require VM_PFNMAP, for ordinary page mappings, we also need to
>>>>>>> disallow COW mappings, since it will not work on architectures that
>>>>>>> don't have CONFIG_ARCH_HAS_PTE_SPECIAL, (see the docs for vm_normal_page()).
>>>>>> Hm I figured everyone just uses MAP_SHARED for buffer objects since
>>>>>> COW really makes absolutely no sense. How would we enforce this?
>>>>> Perhaps returning -EINVAL on is_cow_mapping() at mmap time. Either that
>>>>> or allowing MIXEDMAP.
>>>>>
>>>>>>> Also worth noting is the comment in  ttm_bo_mmap_vma_setup() with
>>>>>>> possible performance implications with x86 + PAT + VM_PFNMAP + normal
>>>>>>> pages. That's a very old comment, though, and might not be valid anymore.
>>>>>> I think that's why ttm has a page cache for these, because it indeed
>>>>>> sucks. The PAT changes on pages are rather expensive.
>>>>> IIRC the page cache was implemented because of the slowness of the
>>>>> caching mode transition itself, more specifically the wbinvd() call +
>>>>> global TLB flush.
>>> Yes, exactly that. The global TLB flush is what really breaks our neck here
>>> from a performance perspective.
>>>
>>>>>> There is still an issue for iomem mappings, because the PAT validation
>>>>>> does a linear walk of the resource tree (lol) for every vm_insert_pfn.
>>>>>> But for i915 at least this is fixed by using the io_mapping
>>>>>> infrastructure, which does the PAT reservation only once when you set
>>>>>> up the mapping area at driver load.
>>>>> Yes, I guess that was the issue that the comment describes, but the
>>>>> issue wasn't there with vm_insert_mixed() + VM_MIXEDMAP.
>>>>>
>>>>>> Also TTM uses VM_PFNMAP right now for everything, so it can't be a
>>>>>> problem that hurts much :-)
>>>>> Hmm, both 5.11 and drm-tip appears to still use MIXEDMAP?
>>>>>
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Flatest%2Fsource%2Fdrivers%2Fgpu%2Fdrm%2Fttm%2Fttm_bo_vm.c%23L554&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Ca93d0dbbc0484fec118808d8d9a4fc22%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637498649935442516%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=7%2BO0WNdBF62eVDy7u4hRydsfviF6dBJEDeZiYIzQAcc%3D&amp;reserved=0
>>>> Uh that's bad, because mixed maps pointing at struct page wont stop
>>>> gup. At least afaik.
>>> Hui? I'm pretty sure MIXEDMAP stops gup as well. Otherwise we would have
>>> already seen tons of problems with the page cache.
>> On any architecture which has CONFIG_ARCH_HAS_PTE_SPECIAL vm_insert_mixed
>> boils down to vm_insert_pfn wrt gup. And special pte stops gup fast path.
>>
>> But if you don't have VM_IO or VM_PFNMAP set, then I'm not seeing how
>> you're stopping gup slow path. See check_vma_flags() in mm/gup.c.
>>
>> Also if you don't have CONFIG_ARCH_HAS_PTE_SPECIAL then I don't think
>> vm_insert_mixed even works on iomem pfns. There's the devmap exception,
>> but we're not devmap. Worse ttm abuses some accidental codepath to smuggle
>> in hugepte support by intentionally not being devmap.
>>
>> So I'm really not sure this works as we think it should. Maybe good to do
>> a quick test program on amdgpu with a buffer in system memory only and try
>> to do direct io into it. If it works, you have a problem, and a bad one.
> That's probably impossible, since a quick git grep shows that pretty
> much anything reasonable has special ptes: arc, arm, arm64, powerpc,
> riscv, s390, sh, sparc, x86. I don't think you'll have a platform
> where you can plug an amdgpu in and actually exercise the bug :-)
>
> So maybe we should just switch over to VM_PFNMAP for ttm for more clarity?

Maybe yes, but not sure.

I've once had a request to do this from some google guys, but rejected 
it because I wasn't sure of the consequences.

Christian.

> -Daniel
>
>
>>> Regards,
>>> Christian.
>>>
>>>> Christian, do we need to patch this up, and maybe fix up ttm fault
>>>> handler to use io_mapping so the vm_insert_pfn stuff is fast?
>>>> -Daniel
>> --
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Ca93d0dbbc0484fec118808d8d9a4fc22%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637498649935442516%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=PmdIbYM6kemXstScf2OoZU9YyXGGzzNzeWEyL8ZDnfo%3D&amp;reserved=0
>
>
John Stultz Feb. 26, 2021, 3:57 a.m. UTC | #11
On Tue, Feb 23, 2021 at 3:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> tldr; DMA buffers aren't normal memory, expecting that you can use
> them like that (like calling get_user_pages works, or that they're
> accounting like any other normal memory) cannot be guaranteed.
>
> Since some userspace only runs on integrated devices, where all
> buffers are actually all resident system memory, there's a huge
> temptation to assume that a struct page is always present and useable
> like for any more pagecache backed mmap. This has the potential to
> result in a uapi nightmare.
>
> To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which
> blocks get_user_pages and all the other struct page based
> infrastructure for everyone. In spirit this is the uapi counterpart to
> the kernel-internal CONFIG_DMABUF_DEBUG.
>
> Motivated by a recent patch which wanted to swich the system dma-buf
> heap to vm_insert_page instead of vm_insert_pfn.
>
> v2:
>
> Jason brought up that we also want to guarantee that all ptes have the
> pte_special flag set, to catch fast get_user_pages (on architectures
> that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would
> still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.
>
> From auditing the various functions to insert pfn pte entires
> (vm_insert_pfn_prot, remap_pfn_range and all it's callers like
> dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so
> this should be the correct flag to check for.
>
> References: https://lore.kernel.org/lkml/CAKMK7uHi+mG0z0HUmNt13QCCvutuRVjpcR0NjRL12k-WbWzkRg@mail.gmail.com/
> Acked-by: Christian König <christian.koenig@amd.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: John Stultz <john.stultz@linaro.org>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Sumit Semwal <sumit.semwal@linaro.org>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> ---
>  drivers/dma-buf/dma-buf.c | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)


So I gave this a spin in a few of my environments, and with the
current dmabuf heaps it spews a lot of warnings.

I'm testing some simple fixes to add:
    vma->vm_flags |= VM_PFNMAP;

to the dmabuf heap mmap ops, which we might want to queue along side of this.

So assuming those can land together.
Acked-by: John Stultz <john.stultz@linaro.org>

thanks
-john
Thomas Hellström (Intel) Feb. 26, 2021, 9:41 a.m. UTC | #12
On 2/25/21 4:49 PM, Daniel Vetter wrote:
> On Thu, Feb 25, 2021 at 11:44 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>> On Thu, Feb 25, 2021 at 11:28:31AM +0100, Christian König wrote:
>>> Am 24.02.21 um 10:31 schrieb Daniel Vetter:
>>>> On Wed, Feb 24, 2021 at 10:16 AM Thomas Hellström (Intel)
>>>> <thomas_os@shipmail.org> wrote:
>>>>> On 2/24/21 9:45 AM, Daniel Vetter wrote:
>>>>>> On Wed, Feb 24, 2021 at 8:46 AM Thomas Hellström (Intel)
>>>>>> <thomas_os@shipmail.org> wrote:
>>>>>>> On 2/23/21 11:59 AM, Daniel Vetter wrote:
>>>>>>>> tldr; DMA buffers aren't normal memory, expecting that you can use
>>>>>>>> them like that (like calling get_user_pages works, or that they're
>>>>>>>> accounting like any other normal memory) cannot be guaranteed.
>>>>>>>>
>>>>>>>> Since some userspace only runs on integrated devices, where all
>>>>>>>> buffers are actually all resident system memory, there's a huge
>>>>>>>> temptation to assume that a struct page is always present and useable
>>>>>>>> like for any more pagecache backed mmap. This has the potential to
>>>>>>>> result in a uapi nightmare.
>>>>>>>>
>>>>>>>> To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which
>>>>>>>> blocks get_user_pages and all the other struct page based
>>>>>>>> infrastructure for everyone. In spirit this is the uapi counterpart to
>>>>>>>> the kernel-internal CONFIG_DMABUF_DEBUG.
>>>>>>>>
>>>>>>>> Motivated by a recent patch which wanted to swich the system dma-buf
>>>>>>>> heap to vm_insert_page instead of vm_insert_pfn.
>>>>>>>>
>>>>>>>> v2:
>>>>>>>>
>>>>>>>> Jason brought up that we also want to guarantee that all ptes have the
>>>>>>>> pte_special flag set, to catch fast get_user_pages (on architectures
>>>>>>>> that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would
>>>>>>>> still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.
>>>>>>>>
>>>>>>>>     From auditing the various functions to insert pfn pte entires
>>>>>>>> (vm_insert_pfn_prot, remap_pfn_range and all it's callers like
>>>>>>>> dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so
>>>>>>>> this should be the correct flag to check for.
>>>>>>>>
>>>>>>> If we require VM_PFNMAP, for ordinary page mappings, we also need to
>>>>>>> disallow COW mappings, since it will not work on architectures that
>>>>>>> don't have CONFIG_ARCH_HAS_PTE_SPECIAL, (see the docs for vm_normal_page()).
>>>>>> Hm I figured everyone just uses MAP_SHARED for buffer objects since
>>>>>> COW really makes absolutely no sense. How would we enforce this?
>>>>> Perhaps returning -EINVAL on is_cow_mapping() at mmap time. Either that
>>>>> or allowing MIXEDMAP.
>>>>>
>>>>>>> Also worth noting is the comment in  ttm_bo_mmap_vma_setup() with
>>>>>>> possible performance implications with x86 + PAT + VM_PFNMAP + normal
>>>>>>> pages. That's a very old comment, though, and might not be valid anymore.
>>>>>> I think that's why ttm has a page cache for these, because it indeed
>>>>>> sucks. The PAT changes on pages are rather expensive.
>>>>> IIRC the page cache was implemented because of the slowness of the
>>>>> caching mode transition itself, more specifically the wbinvd() call +
>>>>> global TLB flush.
>>> Yes, exactly that. The global TLB flush is what really breaks our neck here
>>> from a performance perspective.
>>>
>>>>>> There is still an issue for iomem mappings, because the PAT validation
>>>>>> does a linear walk of the resource tree (lol) for every vm_insert_pfn.
>>>>>> But for i915 at least this is fixed by using the io_mapping
>>>>>> infrastructure, which does the PAT reservation only once when you set
>>>>>> up the mapping area at driver load.
>>>>> Yes, I guess that was the issue that the comment describes, but the
>>>>> issue wasn't there with vm_insert_mixed() + VM_MIXEDMAP.
>>>>>
>>>>>> Also TTM uses VM_PFNMAP right now for everything, so it can't be a
>>>>>> problem that hurts much :-)
>>>>> Hmm, both 5.11 and drm-tip appears to still use MIXEDMAP?
>>>>>
>>>>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/ttm/ttm_bo_vm.c#L554
>>>> Uh that's bad, because mixed maps pointing at struct page wont stop
>>>> gup. At least afaik.
>>> Hui? I'm pretty sure MIXEDMAP stops gup as well. Otherwise we would have
>>> already seen tons of problems with the page cache.
>> On any architecture which has CONFIG_ARCH_HAS_PTE_SPECIAL vm_insert_mixed
>> boils down to vm_insert_pfn wrt gup. And special pte stops gup fast path.
>>
>> But if you don't have VM_IO or VM_PFNMAP set, then I'm not seeing how
>> you're stopping gup slow path. See check_vma_flags() in mm/gup.c.
>>
>> Also if you don't have CONFIG_ARCH_HAS_PTE_SPECIAL then I don't think
>> vm_insert_mixed even works on iomem pfns. There's the devmap exception,
>> but we're not devmap. Worse ttm abuses some accidental codepath to smuggle
>> in hugepte support by intentionally not being devmap.
>>
>> So I'm really not sure this works as we think it should. Maybe good to do
>> a quick test program on amdgpu with a buffer in system memory only and try
>> to do direct io into it. If it works, you have a problem, and a bad one.
> That's probably impossible, since a quick git grep shows that pretty
> much anything reasonable has special ptes: arc, arm, arm64, powerpc,
> riscv, s390, sh, sparc, x86. I don't think you'll have a platform
> where you can plug an amdgpu in and actually exercise the bug :-)

Hm. AFAIK _insert_mixed() doesn't set PTE_SPECIAL on system pages, so I 
don't see what should be stopping gup to those?

/Thomas



>
> So maybe we should just switch over to VM_PFNMAP for ttm for more clarity?
> -Daniel
>
>
>>> Regards,
>>> Christian.
>>>
>>>> Christian, do we need to patch this up, and maybe fix up ttm fault
>>>> handler to use io_mapping so the vm_insert_pfn stuff is fast?
>>>> -Daniel
>> --
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch
>
>
Daniel Vetter Feb. 26, 2021, 1:28 p.m. UTC | #13
On Fri, Feb 26, 2021 at 10:41 AM Thomas Hellström (Intel)
<thomas_os@shipmail.org> wrote:
>
>
> On 2/25/21 4:49 PM, Daniel Vetter wrote:
> > On Thu, Feb 25, 2021 at 11:44 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >> On Thu, Feb 25, 2021 at 11:28:31AM +0100, Christian König wrote:
> >>> Am 24.02.21 um 10:31 schrieb Daniel Vetter:
> >>>> On Wed, Feb 24, 2021 at 10:16 AM Thomas Hellström (Intel)
> >>>> <thomas_os@shipmail.org> wrote:
> >>>>> On 2/24/21 9:45 AM, Daniel Vetter wrote:
> >>>>>> On Wed, Feb 24, 2021 at 8:46 AM Thomas Hellström (Intel)
> >>>>>> <thomas_os@shipmail.org> wrote:
> >>>>>>> On 2/23/21 11:59 AM, Daniel Vetter wrote:
> >>>>>>>> tldr; DMA buffers aren't normal memory, expecting that you can use
> >>>>>>>> them like that (like calling get_user_pages works, or that they're
> >>>>>>>> accounting like any other normal memory) cannot be guaranteed.
> >>>>>>>>
> >>>>>>>> Since some userspace only runs on integrated devices, where all
> >>>>>>>> buffers are actually all resident system memory, there's a huge
> >>>>>>>> temptation to assume that a struct page is always present and useable
> >>>>>>>> like for any more pagecache backed mmap. This has the potential to
> >>>>>>>> result in a uapi nightmare.
> >>>>>>>>
> >>>>>>>> To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which
> >>>>>>>> blocks get_user_pages and all the other struct page based
> >>>>>>>> infrastructure for everyone. In spirit this is the uapi counterpart to
> >>>>>>>> the kernel-internal CONFIG_DMABUF_DEBUG.
> >>>>>>>>
> >>>>>>>> Motivated by a recent patch which wanted to swich the system dma-buf
> >>>>>>>> heap to vm_insert_page instead of vm_insert_pfn.
> >>>>>>>>
> >>>>>>>> v2:
> >>>>>>>>
> >>>>>>>> Jason brought up that we also want to guarantee that all ptes have the
> >>>>>>>> pte_special flag set, to catch fast get_user_pages (on architectures
> >>>>>>>> that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would
> >>>>>>>> still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.
> >>>>>>>>
> >>>>>>>>     From auditing the various functions to insert pfn pte entires
> >>>>>>>> (vm_insert_pfn_prot, remap_pfn_range and all it's callers like
> >>>>>>>> dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so
> >>>>>>>> this should be the correct flag to check for.
> >>>>>>>>
> >>>>>>> If we require VM_PFNMAP, for ordinary page mappings, we also need to
> >>>>>>> disallow COW mappings, since it will not work on architectures that
> >>>>>>> don't have CONFIG_ARCH_HAS_PTE_SPECIAL, (see the docs for vm_normal_page()).
> >>>>>> Hm I figured everyone just uses MAP_SHARED for buffer objects since
> >>>>>> COW really makes absolutely no sense. How would we enforce this?
> >>>>> Perhaps returning -EINVAL on is_cow_mapping() at mmap time. Either that
> >>>>> or allowing MIXEDMAP.
> >>>>>
> >>>>>>> Also worth noting is the comment in  ttm_bo_mmap_vma_setup() with
> >>>>>>> possible performance implications with x86 + PAT + VM_PFNMAP + normal
> >>>>>>> pages. That's a very old comment, though, and might not be valid anymore.
> >>>>>> I think that's why ttm has a page cache for these, because it indeed
> >>>>>> sucks. The PAT changes on pages are rather expensive.
> >>>>> IIRC the page cache was implemented because of the slowness of the
> >>>>> caching mode transition itself, more specifically the wbinvd() call +
> >>>>> global TLB flush.
> >>> Yes, exactly that. The global TLB flush is what really breaks our neck here
> >>> from a performance perspective.
> >>>
> >>>>>> There is still an issue for iomem mappings, because the PAT validation
> >>>>>> does a linear walk of the resource tree (lol) for every vm_insert_pfn.
> >>>>>> But for i915 at least this is fixed by using the io_mapping
> >>>>>> infrastructure, which does the PAT reservation only once when you set
> >>>>>> up the mapping area at driver load.
> >>>>> Yes, I guess that was the issue that the comment describes, but the
> >>>>> issue wasn't there with vm_insert_mixed() + VM_MIXEDMAP.
> >>>>>
> >>>>>> Also TTM uses VM_PFNMAP right now for everything, so it can't be a
> >>>>>> problem that hurts much :-)
> >>>>> Hmm, both 5.11 and drm-tip appears to still use MIXEDMAP?
> >>>>>
> >>>>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/ttm/ttm_bo_vm.c#L554
> >>>> Uh that's bad, because mixed maps pointing at struct page wont stop
> >>>> gup. At least afaik.
> >>> Hui? I'm pretty sure MIXEDMAP stops gup as well. Otherwise we would have
> >>> already seen tons of problems with the page cache.
> >> On any architecture which has CONFIG_ARCH_HAS_PTE_SPECIAL vm_insert_mixed
> >> boils down to vm_insert_pfn wrt gup. And special pte stops gup fast path.
> >>
> >> But if you don't have VM_IO or VM_PFNMAP set, then I'm not seeing how
> >> you're stopping gup slow path. See check_vma_flags() in mm/gup.c.
> >>
> >> Also if you don't have CONFIG_ARCH_HAS_PTE_SPECIAL then I don't think
> >> vm_insert_mixed even works on iomem pfns. There's the devmap exception,
> >> but we're not devmap. Worse ttm abuses some accidental codepath to smuggle
> >> in hugepte support by intentionally not being devmap.
> >>
> >> So I'm really not sure this works as we think it should. Maybe good to do
> >> a quick test program on amdgpu with a buffer in system memory only and try
> >> to do direct io into it. If it works, you have a problem, and a bad one.
> > That's probably impossible, since a quick git grep shows that pretty
> > much anything reasonable has special ptes: arc, arm, arm64, powerpc,
> > riscv, s390, sh, sparc, x86. I don't think you'll have a platform
> > where you can plug an amdgpu in and actually exercise the bug :-)
>
> Hm. AFAIK _insert_mixed() doesn't set PTE_SPECIAL on system pages, so I
> don't see what should be stopping gup to those?

If you have an arch with pte special we use insert_pfn(), which afaict
will use pte_mkspecial for the !devmap case. And ttm isn't devmap
(otherwise our hugepte abuse of devmap hugeptes would go rather
wrong).

So I think it stops gup. But I haven't verified at all. Would be good
if Christian can check this with some direct io to a buffer in system
memory.
-Daniel
Thomas Hellström (Intel) Feb. 27, 2021, 8:06 a.m. UTC | #14
On 2/26/21 2:28 PM, Daniel Vetter wrote:
> On Fri, Feb 26, 2021 at 10:41 AM Thomas Hellström (Intel)
> <thomas_os@shipmail.org> wrote:
>>
>> On 2/25/21 4:49 PM, Daniel Vetter wrote:
>>> On Thu, Feb 25, 2021 at 11:44 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>>>> On Thu, Feb 25, 2021 at 11:28:31AM +0100, Christian König wrote:
>>>>> Am 24.02.21 um 10:31 schrieb Daniel Vetter:
>>>>>> On Wed, Feb 24, 2021 at 10:16 AM Thomas Hellström (Intel)
>>>>>> <thomas_os@shipmail.org> wrote:
>>>>>>> On 2/24/21 9:45 AM, Daniel Vetter wrote:
>>>>>>>> On Wed, Feb 24, 2021 at 8:46 AM Thomas Hellström (Intel)
>>>>>>>> <thomas_os@shipmail.org> wrote:
>>>>>>>>> On 2/23/21 11:59 AM, Daniel Vetter wrote:
>>>>>>>>>> tldr; DMA buffers aren't normal memory, expecting that you can use
>>>>>>>>>> them like that (like calling get_user_pages works, or that they're
>>>>>>>>>> accounting like any other normal memory) cannot be guaranteed.
>>>>>>>>>>
>>>>>>>>>> Since some userspace only runs on integrated devices, where all
>>>>>>>>>> buffers are actually all resident system memory, there's a huge
>>>>>>>>>> temptation to assume that a struct page is always present and useable
>>>>>>>>>> like for any more pagecache backed mmap. This has the potential to
>>>>>>>>>> result in a uapi nightmare.
>>>>>>>>>>
>>>>>>>>>> To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which
>>>>>>>>>> blocks get_user_pages and all the other struct page based
>>>>>>>>>> infrastructure for everyone. In spirit this is the uapi counterpart to
>>>>>>>>>> the kernel-internal CONFIG_DMABUF_DEBUG.
>>>>>>>>>>
>>>>>>>>>> Motivated by a recent patch which wanted to swich the system dma-buf
>>>>>>>>>> heap to vm_insert_page instead of vm_insert_pfn.
>>>>>>>>>>
>>>>>>>>>> v2:
>>>>>>>>>>
>>>>>>>>>> Jason brought up that we also want to guarantee that all ptes have the
>>>>>>>>>> pte_special flag set, to catch fast get_user_pages (on architectures
>>>>>>>>>> that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would
>>>>>>>>>> still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.
>>>>>>>>>>
>>>>>>>>>>      From auditing the various functions to insert pfn pte entires
>>>>>>>>>> (vm_insert_pfn_prot, remap_pfn_range and all it's callers like
>>>>>>>>>> dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so
>>>>>>>>>> this should be the correct flag to check for.
>>>>>>>>>>
>>>>>>>>> If we require VM_PFNMAP, for ordinary page mappings, we also need to
>>>>>>>>> disallow COW mappings, since it will not work on architectures that
>>>>>>>>> don't have CONFIG_ARCH_HAS_PTE_SPECIAL, (see the docs for vm_normal_page()).
>>>>>>>> Hm I figured everyone just uses MAP_SHARED for buffer objects since
>>>>>>>> COW really makes absolutely no sense. How would we enforce this?
>>>>>>> Perhaps returning -EINVAL on is_cow_mapping() at mmap time. Either that
>>>>>>> or allowing MIXEDMAP.
>>>>>>>
>>>>>>>>> Also worth noting is the comment in  ttm_bo_mmap_vma_setup() with
>>>>>>>>> possible performance implications with x86 + PAT + VM_PFNMAP + normal
>>>>>>>>> pages. That's a very old comment, though, and might not be valid anymore.
>>>>>>>> I think that's why ttm has a page cache for these, because it indeed
>>>>>>>> sucks. The PAT changes on pages are rather expensive.
>>>>>>> IIRC the page cache was implemented because of the slowness of the
>>>>>>> caching mode transition itself, more specifically the wbinvd() call +
>>>>>>> global TLB flush.
>>>>> Yes, exactly that. The global TLB flush is what really breaks our neck here
>>>>> from a performance perspective.
>>>>>
>>>>>>>> There is still an issue for iomem mappings, because the PAT validation
>>>>>>>> does a linear walk of the resource tree (lol) for every vm_insert_pfn.
>>>>>>>> But for i915 at least this is fixed by using the io_mapping
>>>>>>>> infrastructure, which does the PAT reservation only once when you set
>>>>>>>> up the mapping area at driver load.
>>>>>>> Yes, I guess that was the issue that the comment describes, but the
>>>>>>> issue wasn't there with vm_insert_mixed() + VM_MIXEDMAP.
>>>>>>>
>>>>>>>> Also TTM uses VM_PFNMAP right now for everything, so it can't be a
>>>>>>>> problem that hurts much :-)
>>>>>>> Hmm, both 5.11 and drm-tip appears to still use MIXEDMAP?
>>>>>>>
>>>>>>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/ttm/ttm_bo_vm.c#L554
>>>>>> Uh that's bad, because mixed maps pointing at struct page wont stop
>>>>>> gup. At least afaik.
>>>>> Hui? I'm pretty sure MIXEDMAP stops gup as well. Otherwise we would have
>>>>> already seen tons of problems with the page cache.
>>>> On any architecture which has CONFIG_ARCH_HAS_PTE_SPECIAL vm_insert_mixed
>>>> boils down to vm_insert_pfn wrt gup. And special pte stops gup fast path.
>>>>
>>>> But if you don't have VM_IO or VM_PFNMAP set, then I'm not seeing how
>>>> you're stopping gup slow path. See check_vma_flags() in mm/gup.c.
>>>>
>>>> Also if you don't have CONFIG_ARCH_HAS_PTE_SPECIAL then I don't think
>>>> vm_insert_mixed even works on iomem pfns. There's the devmap exception,
>>>> but we're not devmap. Worse ttm abuses some accidental codepath to smuggle
>>>> in hugepte support by intentionally not being devmap.
>>>>
>>>> So I'm really not sure this works as we think it should. Maybe good to do
>>>> a quick test program on amdgpu with a buffer in system memory only and try
>>>> to do direct io into it. If it works, you have a problem, and a bad one.
>>> That's probably impossible, since a quick git grep shows that pretty
>>> much anything reasonable has special ptes: arc, arm, arm64, powerpc,
>>> riscv, s390, sh, sparc, x86. I don't think you'll have a platform
>>> where you can plug an amdgpu in and actually exercise the bug :-)
>> Hm. AFAIK _insert_mixed() doesn't set PTE_SPECIAL on system pages, so I
>> don't see what should be stopping gup to those?
> If you have an arch with pte special we use insert_pfn(), which afaict
> will use pte_mkspecial for the !devmap case. And ttm isn't devmap
> (otherwise our hugepte abuse of devmap hugeptes would go rather
> wrong).
>
> So I think it stops gup. But I haven't verified at all. Would be good
> if Christian can check this with some direct io to a buffer in system
> memory.

Hmm,

Docs (again vm_normal_page() say)

  * VM_MIXEDMAP mappings can likewise contain memory with or without "struct
  * page" backing, however the difference is that _all_ pages with a struct
  * page (that is, those where pfn_valid is true) are refcounted and 
considered
  * normal pages by the VM. The disadvantage is that pages are refcounted
  * (which can be slower and simply not an option for some PFNMAP 
users). The
  * advantage is that we don't have to follow the strict linearity rule of
  * PFNMAP mappings in order to support COWable mappings.

but it's true __vm_insert_mixed() ends up in the insert_pfn() path, so 
the above isn't really true, which makes me wonder if and in that case 
why there could any longer ever be a significant performance difference 
between MIXEDMAP and PFNMAP.

BTW regarding the TTM hugeptes, I don't think we ever landed that devmap 
hack, so they are (for the non-gup case) relying on 
vma_is_special_huge(). For the gup case, I think the bug is still there.

/Thomas

> -Daniel
Daniel Vetter March 1, 2021, 8:28 a.m. UTC | #15
On Sat, Feb 27, 2021 at 9:06 AM Thomas Hellström (Intel)
<thomas_os@shipmail.org> wrote:
> On 2/26/21 2:28 PM, Daniel Vetter wrote:
> > So I think it stops gup. But I haven't verified at all. Would be good
> > if Christian can check this with some direct io to a buffer in system
> > memory.
>
> Hmm,
>
> Docs (again vm_normal_page() say)
>
>   * VM_MIXEDMAP mappings can likewise contain memory with or without "struct
>   * page" backing, however the difference is that _all_ pages with a struct
>   * page (that is, those where pfn_valid is true) are refcounted and
> considered
>   * normal pages by the VM. The disadvantage is that pages are refcounted
>   * (which can be slower and simply not an option for some PFNMAP
> users). The
>   * advantage is that we don't have to follow the strict linearity rule of
>   * PFNMAP mappings in order to support COWable mappings.
>
> but it's true __vm_insert_mixed() ends up in the insert_pfn() path, so
> the above isn't really true, which makes me wonder if and in that case
> why there could any longer ever be a significant performance difference
> between MIXEDMAP and PFNMAP.

Yeah it's definitely confusing. I guess I'll hack up a patch and see
what sticks.

> BTW regarding the TTM hugeptes, I don't think we ever landed that devmap
> hack, so they are (for the non-gup case) relying on
> vma_is_special_huge(). For the gup case, I think the bug is still there.

Maybe there's another devmap hack, but the ttm_vm_insert functions do
use PFN_DEV and all that. And I think that stops gup_fast from trying
to find the underlying page.
-Daniel
Thomas Hellström (Intel) March 1, 2021, 8:39 a.m. UTC | #16
Hi,

On 3/1/21 9:28 AM, Daniel Vetter wrote:
> On Sat, Feb 27, 2021 at 9:06 AM Thomas Hellström (Intel)
> <thomas_os@shipmail.org> wrote:
>> On 2/26/21 2:28 PM, Daniel Vetter wrote:
>>> So I think it stops gup. But I haven't verified at all. Would be good
>>> if Christian can check this with some direct io to a buffer in system
>>> memory.
>> Hmm,
>>
>> Docs (again vm_normal_page() say)
>>
>>    * VM_MIXEDMAP mappings can likewise contain memory with or without "struct
>>    * page" backing, however the difference is that _all_ pages with a struct
>>    * page (that is, those where pfn_valid is true) are refcounted and
>> considered
>>    * normal pages by the VM. The disadvantage is that pages are refcounted
>>    * (which can be slower and simply not an option for some PFNMAP
>> users). The
>>    * advantage is that we don't have to follow the strict linearity rule of
>>    * PFNMAP mappings in order to support COWable mappings.
>>
>> but it's true __vm_insert_mixed() ends up in the insert_pfn() path, so
>> the above isn't really true, which makes me wonder if and in that case
>> why there could any longer ever be a significant performance difference
>> between MIXEDMAP and PFNMAP.
> Yeah it's definitely confusing. I guess I'll hack up a patch and see
> what sticks.
>
>> BTW regarding the TTM hugeptes, I don't think we ever landed that devmap
>> hack, so they are (for the non-gup case) relying on
>> vma_is_special_huge(). For the gup case, I think the bug is still there.
> Maybe there's another devmap hack, but the ttm_vm_insert functions do
> use PFN_DEV and all that. And I think that stops gup_fast from trying
> to find the underlying page.
> -Daniel

Hmm perhaps it might, but I don't think so. The fix I tried out was to set

PFN_DEV | PFN_MAP for huge PTEs which causes pfn_devmap() to be true, 
and then

follow_devmap_pmd()->get_dev_pagemap() which returns NULL and gup_fast() 
backs off,

in the end that would mean setting in stone that "if there is a huge 
devmap page table entry for which we haven't registered any devmap 
struct pages (get_dev_pagemap returns NULL), we should treat that as a 
"special" huge page table entry".

 From what I can tell, all code calling get_dev_pagemap() already does 
that, it's just a question of getting it accepted and formalizing it.

/Thomas
Daniel Vetter March 1, 2021, 9:05 a.m. UTC | #17
On Mon, Mar 01, 2021 at 09:39:53AM +0100, Thomas Hellström (Intel) wrote:
> Hi,
> 
> On 3/1/21 9:28 AM, Daniel Vetter wrote:
> > On Sat, Feb 27, 2021 at 9:06 AM Thomas Hellström (Intel)
> > <thomas_os@shipmail.org> wrote:
> > > On 2/26/21 2:28 PM, Daniel Vetter wrote:
> > > > So I think it stops gup. But I haven't verified at all. Would be good
> > > > if Christian can check this with some direct io to a buffer in system
> > > > memory.
> > > Hmm,
> > > 
> > > Docs (again vm_normal_page() say)
> > > 
> > >    * VM_MIXEDMAP mappings can likewise contain memory with or without "struct
> > >    * page" backing, however the difference is that _all_ pages with a struct
> > >    * page (that is, those where pfn_valid is true) are refcounted and
> > > considered
> > >    * normal pages by the VM. The disadvantage is that pages are refcounted
> > >    * (which can be slower and simply not an option for some PFNMAP
> > > users). The
> > >    * advantage is that we don't have to follow the strict linearity rule of
> > >    * PFNMAP mappings in order to support COWable mappings.
> > > 
> > > but it's true __vm_insert_mixed() ends up in the insert_pfn() path, so
> > > the above isn't really true, which makes me wonder if and in that case
> > > why there could any longer ever be a significant performance difference
> > > between MIXEDMAP and PFNMAP.
> > Yeah it's definitely confusing. I guess I'll hack up a patch and see
> > what sticks.
> > 
> > > BTW regarding the TTM hugeptes, I don't think we ever landed that devmap
> > > hack, so they are (for the non-gup case) relying on
> > > vma_is_special_huge(). For the gup case, I think the bug is still there.
> > Maybe there's another devmap hack, but the ttm_vm_insert functions do
> > use PFN_DEV and all that. And I think that stops gup_fast from trying
> > to find the underlying page.
> > -Daniel
> 
> Hmm perhaps it might, but I don't think so. The fix I tried out was to set
> 
> PFN_DEV | PFN_MAP for huge PTEs which causes pfn_devmap() to be true, and
> then
> 
> follow_devmap_pmd()->get_dev_pagemap() which returns NULL and gup_fast()
> backs off,
> 
> in the end that would mean setting in stone that "if there is a huge devmap
> page table entry for which we haven't registered any devmap struct pages
> (get_dev_pagemap returns NULL), we should treat that as a "special" huge
> page table entry".
> 
> From what I can tell, all code calling get_dev_pagemap() already does that,
> it's just a question of getting it accepted and formalizing it.

Oh I thought that's already how it works, since I didn't spot anything
else that would block gup_fast from falling over. I guess really would
need some testcases to make sure direct i/o (that's the easiest to test)
fails like we expect.
-Daniel
Thomas Hellström (Intel) March 1, 2021, 9:21 a.m. UTC | #18
On 3/1/21 10:05 AM, Daniel Vetter wrote:
> On Mon, Mar 01, 2021 at 09:39:53AM +0100, Thomas Hellström (Intel) wrote:
>> Hi,
>>
>> On 3/1/21 9:28 AM, Daniel Vetter wrote:
>>> On Sat, Feb 27, 2021 at 9:06 AM Thomas Hellström (Intel)
>>> <thomas_os@shipmail.org> wrote:
>>>> On 2/26/21 2:28 PM, Daniel Vetter wrote:
>>>>> So I think it stops gup. But I haven't verified at all. Would be good
>>>>> if Christian can check this with some direct io to a buffer in system
>>>>> memory.
>>>> Hmm,
>>>>
>>>> Docs (again vm_normal_page() say)
>>>>
>>>>     * VM_MIXEDMAP mappings can likewise contain memory with or without "struct
>>>>     * page" backing, however the difference is that _all_ pages with a struct
>>>>     * page (that is, those where pfn_valid is true) are refcounted and
>>>> considered
>>>>     * normal pages by the VM. The disadvantage is that pages are refcounted
>>>>     * (which can be slower and simply not an option for some PFNMAP
>>>> users). The
>>>>     * advantage is that we don't have to follow the strict linearity rule of
>>>>     * PFNMAP mappings in order to support COWable mappings.
>>>>
>>>> but it's true __vm_insert_mixed() ends up in the insert_pfn() path, so
>>>> the above isn't really true, which makes me wonder if and in that case
>>>> why there could any longer ever be a significant performance difference
>>>> between MIXEDMAP and PFNMAP.
>>> Yeah it's definitely confusing. I guess I'll hack up a patch and see
>>> what sticks.
>>>
>>>> BTW regarding the TTM hugeptes, I don't think we ever landed that devmap
>>>> hack, so they are (for the non-gup case) relying on
>>>> vma_is_special_huge(). For the gup case, I think the bug is still there.
>>> Maybe there's another devmap hack, but the ttm_vm_insert functions do
>>> use PFN_DEV and all that. And I think that stops gup_fast from trying
>>> to find the underlying page.
>>> -Daniel
>> Hmm perhaps it might, but I don't think so. The fix I tried out was to set
>>
>> PFN_DEV | PFN_MAP for huge PTEs which causes pfn_devmap() to be true, and
>> then
>>
>> follow_devmap_pmd()->get_dev_pagemap() which returns NULL and gup_fast()
>> backs off,
>>
>> in the end that would mean setting in stone that "if there is a huge devmap
>> page table entry for which we haven't registered any devmap struct pages
>> (get_dev_pagemap returns NULL), we should treat that as a "special" huge
>> page table entry".
>>
>>  From what I can tell, all code calling get_dev_pagemap() already does that,
>> it's just a question of getting it accepted and formalizing it.
> Oh I thought that's already how it works, since I didn't spot anything
> else that would block gup_fast from falling over. I guess really would
> need some testcases to make sure direct i/o (that's the easiest to test)
> fails like we expect.

Yeah, IIRC the "| PFN_MAP" is the missing piece for TTM huge ptes. 
Otherwise pmd_devmap() will not return true and since there is no 
pmd_special() things break.

/Thomas



> -Daniel
Christian König March 1, 2021, 10:17 a.m. UTC | #19
Am 01.03.21 um 10:21 schrieb Thomas Hellström (Intel):
>
> On 3/1/21 10:05 AM, Daniel Vetter wrote:
>> On Mon, Mar 01, 2021 at 09:39:53AM +0100, Thomas Hellström (Intel) 
>> wrote:
>>> Hi,
>>>
>>> On 3/1/21 9:28 AM, Daniel Vetter wrote:
>>>> On Sat, Feb 27, 2021 at 9:06 AM Thomas Hellström (Intel)
>>>> <thomas_os@shipmail.org> wrote:
>>>>> On 2/26/21 2:28 PM, Daniel Vetter wrote:
>>>>>> So I think it stops gup. But I haven't verified at all. Would be 
>>>>>> good
>>>>>> if Christian can check this with some direct io to a buffer in 
>>>>>> system
>>>>>> memory.
>>>>> Hmm,
>>>>>
>>>>> Docs (again vm_normal_page() say)
>>>>>
>>>>>     * VM_MIXEDMAP mappings can likewise contain memory with or 
>>>>> without "struct
>>>>>     * page" backing, however the difference is that _all_ pages 
>>>>> with a struct
>>>>>     * page (that is, those where pfn_valid is true) are refcounted 
>>>>> and
>>>>> considered
>>>>>     * normal pages by the VM. The disadvantage is that pages are 
>>>>> refcounted
>>>>>     * (which can be slower and simply not an option for some PFNMAP
>>>>> users). The
>>>>>     * advantage is that we don't have to follow the strict 
>>>>> linearity rule of
>>>>>     * PFNMAP mappings in order to support COWable mappings.
>>>>>
>>>>> but it's true __vm_insert_mixed() ends up in the insert_pfn() 
>>>>> path, so
>>>>> the above isn't really true, which makes me wonder if and in that 
>>>>> case
>>>>> why there could any longer ever be a significant performance 
>>>>> difference
>>>>> between MIXEDMAP and PFNMAP.
>>>> Yeah it's definitely confusing. I guess I'll hack up a patch and see
>>>> what sticks.
>>>>
>>>>> BTW regarding the TTM hugeptes, I don't think we ever landed that 
>>>>> devmap
>>>>> hack, so they are (for the non-gup case) relying on
>>>>> vma_is_special_huge(). For the gup case, I think the bug is still 
>>>>> there.
>>>> Maybe there's another devmap hack, but the ttm_vm_insert functions do
>>>> use PFN_DEV and all that. And I think that stops gup_fast from trying
>>>> to find the underlying page.
>>>> -Daniel
>>> Hmm perhaps it might, but I don't think so. The fix I tried out was 
>>> to set
>>>
>>> PFN_DEV | PFN_MAP for huge PTEs which causes pfn_devmap() to be 
>>> true, and
>>> then
>>>
>>> follow_devmap_pmd()->get_dev_pagemap() which returns NULL and 
>>> gup_fast()
>>> backs off,
>>>
>>> in the end that would mean setting in stone that "if there is a huge 
>>> devmap
>>> page table entry for which we haven't registered any devmap struct 
>>> pages
>>> (get_dev_pagemap returns NULL), we should treat that as a "special" 
>>> huge
>>> page table entry".
>>>
>>>  From what I can tell, all code calling get_dev_pagemap() already 
>>> does that,
>>> it's just a question of getting it accepted and formalizing it.
>> Oh I thought that's already how it works, since I didn't spot anything
>> else that would block gup_fast from falling over. I guess really would
>> need some testcases to make sure direct i/o (that's the easiest to test)
>> fails like we expect.
>
> Yeah, IIRC the "| PFN_MAP" is the missing piece for TTM huge ptes. 
> Otherwise pmd_devmap() will not return true and since there is no 
> pmd_special() things break.

Is that maybe the issue we have seen with amdgpu and huge pages?

Apart from that I'm lost guys, that devmap and gup stuff is not 
something I have a good knowledge of apart from a one mile high view.

Christian.

>
> /Thomas
>
>
>
>> -Daniel
Daniel Vetter March 1, 2021, 2:09 p.m. UTC | #20
On Mon, Mar 1, 2021 at 11:17 AM Christian König
<christian.koenig@amd.com> wrote:
>
>
>
> Am 01.03.21 um 10:21 schrieb Thomas Hellström (Intel):
> >
> > On 3/1/21 10:05 AM, Daniel Vetter wrote:
> >> On Mon, Mar 01, 2021 at 09:39:53AM +0100, Thomas Hellström (Intel)
> >> wrote:
> >>> Hi,
> >>>
> >>> On 3/1/21 9:28 AM, Daniel Vetter wrote:
> >>>> On Sat, Feb 27, 2021 at 9:06 AM Thomas Hellström (Intel)
> >>>> <thomas_os@shipmail.org> wrote:
> >>>>> On 2/26/21 2:28 PM, Daniel Vetter wrote:
> >>>>>> So I think it stops gup. But I haven't verified at all. Would be
> >>>>>> good
> >>>>>> if Christian can check this with some direct io to a buffer in
> >>>>>> system
> >>>>>> memory.
> >>>>> Hmm,
> >>>>>
> >>>>> Docs (again vm_normal_page() say)
> >>>>>
> >>>>>     * VM_MIXEDMAP mappings can likewise contain memory with or
> >>>>> without "struct
> >>>>>     * page" backing, however the difference is that _all_ pages
> >>>>> with a struct
> >>>>>     * page (that is, those where pfn_valid is true) are refcounted
> >>>>> and
> >>>>> considered
> >>>>>     * normal pages by the VM. The disadvantage is that pages are
> >>>>> refcounted
> >>>>>     * (which can be slower and simply not an option for some PFNMAP
> >>>>> users). The
> >>>>>     * advantage is that we don't have to follow the strict
> >>>>> linearity rule of
> >>>>>     * PFNMAP mappings in order to support COWable mappings.
> >>>>>
> >>>>> but it's true __vm_insert_mixed() ends up in the insert_pfn()
> >>>>> path, so
> >>>>> the above isn't really true, which makes me wonder if and in that
> >>>>> case
> >>>>> why there could any longer ever be a significant performance
> >>>>> difference
> >>>>> between MIXEDMAP and PFNMAP.
> >>>> Yeah it's definitely confusing. I guess I'll hack up a patch and see
> >>>> what sticks.
> >>>>
> >>>>> BTW regarding the TTM hugeptes, I don't think we ever landed that
> >>>>> devmap
> >>>>> hack, so they are (for the non-gup case) relying on
> >>>>> vma_is_special_huge(). For the gup case, I think the bug is still
> >>>>> there.
> >>>> Maybe there's another devmap hack, but the ttm_vm_insert functions do
> >>>> use PFN_DEV and all that. And I think that stops gup_fast from trying
> >>>> to find the underlying page.
> >>>> -Daniel
> >>> Hmm perhaps it might, but I don't think so. The fix I tried out was
> >>> to set
> >>>
> >>> PFN_DEV | PFN_MAP for huge PTEs which causes pfn_devmap() to be
> >>> true, and
> >>> then
> >>>
> >>> follow_devmap_pmd()->get_dev_pagemap() which returns NULL and
> >>> gup_fast()
> >>> backs off,
> >>>
> >>> in the end that would mean setting in stone that "if there is a huge
> >>> devmap
> >>> page table entry for which we haven't registered any devmap struct
> >>> pages
> >>> (get_dev_pagemap returns NULL), we should treat that as a "special"
> >>> huge
> >>> page table entry".
> >>>
> >>>  From what I can tell, all code calling get_dev_pagemap() already
> >>> does that,
> >>> it's just a question of getting it accepted and formalizing it.
> >> Oh I thought that's already how it works, since I didn't spot anything
> >> else that would block gup_fast from falling over. I guess really would
> >> need some testcases to make sure direct i/o (that's the easiest to test)
> >> fails like we expect.
> >
> > Yeah, IIRC the "| PFN_MAP" is the missing piece for TTM huge ptes.
> > Otherwise pmd_devmap() will not return true and since there is no
> > pmd_special() things break.
>
> Is that maybe the issue we have seen with amdgpu and huge pages?

Yeah, essentially when you have a hugepte inserted by ttm, and it
happens to point at system memory, then gup will work on that. And
create all kinds of havoc.

> Apart from that I'm lost guys, that devmap and gup stuff is not
> something I have a good knowledge of apart from a one mile high view.

I'm not really better, hence would be good to do a testcase and see.
This should provoke it:
- allocate nicely aligned bo in system memory
- mmap, again nicely aligned to 2M
- do some direct io from a filesystem into that mmap, that should trigger gup
- before the gup completes free the mmap and bo so that ttm recycles
the pages, which should trip up on the elevated refcount. If you wait
until the direct io is completely, then I think nothing bad can be
observed.

Ofc if your amdgpu+hugepte issue is something else, then maybe we have
another issue.

Also usual caveat: I'm not an mm hacker either, so might be completely wrong.
-Daniel
Thomas Hellström (Intel) March 11, 2021, 10:22 a.m. UTC | #21
On 3/1/21 3:09 PM, Daniel Vetter wrote:
> On Mon, Mar 1, 2021 at 11:17 AM Christian König
> <christian.koenig@amd.com> wrote:
>>
>>
>> Am 01.03.21 um 10:21 schrieb Thomas Hellström (Intel):
>>> On 3/1/21 10:05 AM, Daniel Vetter wrote:
>>>> On Mon, Mar 01, 2021 at 09:39:53AM +0100, Thomas Hellström (Intel)
>>>> wrote:
>>>>> Hi,
>>>>>
>>>>> On 3/1/21 9:28 AM, Daniel Vetter wrote:
>>>>>> On Sat, Feb 27, 2021 at 9:06 AM Thomas Hellström (Intel)
>>>>>> <thomas_os@shipmail.org> wrote:
>>>>>>> On 2/26/21 2:28 PM, Daniel Vetter wrote:
>>>>>>>> So I think it stops gup. But I haven't verified at all. Would be
>>>>>>>> good
>>>>>>>> if Christian can check this with some direct io to a buffer in
>>>>>>>> system
>>>>>>>> memory.
>>>>>>> Hmm,
>>>>>>>
>>>>>>> Docs (again vm_normal_page() say)
>>>>>>>
>>>>>>>      * VM_MIXEDMAP mappings can likewise contain memory with or
>>>>>>> without "struct
>>>>>>>      * page" backing, however the difference is that _all_ pages
>>>>>>> with a struct
>>>>>>>      * page (that is, those where pfn_valid is true) are refcounted
>>>>>>> and
>>>>>>> considered
>>>>>>>      * normal pages by the VM. The disadvantage is that pages are
>>>>>>> refcounted
>>>>>>>      * (which can be slower and simply not an option for some PFNMAP
>>>>>>> users). The
>>>>>>>      * advantage is that we don't have to follow the strict
>>>>>>> linearity rule of
>>>>>>>      * PFNMAP mappings in order to support COWable mappings.
>>>>>>>
>>>>>>> but it's true __vm_insert_mixed() ends up in the insert_pfn()
>>>>>>> path, so
>>>>>>> the above isn't really true, which makes me wonder if and in that
>>>>>>> case
>>>>>>> why there could any longer ever be a significant performance
>>>>>>> difference
>>>>>>> between MIXEDMAP and PFNMAP.
>>>>>> Yeah it's definitely confusing. I guess I'll hack up a patch and see
>>>>>> what sticks.
>>>>>>
>>>>>>> BTW regarding the TTM hugeptes, I don't think we ever landed that
>>>>>>> devmap
>>>>>>> hack, so they are (for the non-gup case) relying on
>>>>>>> vma_is_special_huge(). For the gup case, I think the bug is still
>>>>>>> there.
>>>>>> Maybe there's another devmap hack, but the ttm_vm_insert functions do
>>>>>> use PFN_DEV and all that. And I think that stops gup_fast from trying
>>>>>> to find the underlying page.
>>>>>> -Daniel
>>>>> Hmm perhaps it might, but I don't think so. The fix I tried out was
>>>>> to set
>>>>>
>>>>> PFN_DEV | PFN_MAP for huge PTEs which causes pfn_devmap() to be
>>>>> true, and
>>>>> then
>>>>>
>>>>> follow_devmap_pmd()->get_dev_pagemap() which returns NULL and
>>>>> gup_fast()
>>>>> backs off,
>>>>>
>>>>> in the end that would mean setting in stone that "if there is a huge
>>>>> devmap
>>>>> page table entry for which we haven't registered any devmap struct
>>>>> pages
>>>>> (get_dev_pagemap returns NULL), we should treat that as a "special"
>>>>> huge
>>>>> page table entry".
>>>>>
>>>>>   From what I can tell, all code calling get_dev_pagemap() already
>>>>> does that,
>>>>> it's just a question of getting it accepted and formalizing it.
>>>> Oh I thought that's already how it works, since I didn't spot anything
>>>> else that would block gup_fast from falling over. I guess really would
>>>> need some testcases to make sure direct i/o (that's the easiest to test)
>>>> fails like we expect.
>>> Yeah, IIRC the "| PFN_MAP" is the missing piece for TTM huge ptes.
>>> Otherwise pmd_devmap() will not return true and since there is no
>>> pmd_special() things break.
>> Is that maybe the issue we have seen with amdgpu and huge pages?
> Yeah, essentially when you have a hugepte inserted by ttm, and it
> happens to point at system memory, then gup will work on that. And
> create all kinds of havoc.
>
>> Apart from that I'm lost guys, that devmap and gup stuff is not
>> something I have a good knowledge of apart from a one mile high view.
> I'm not really better, hence would be good to do a testcase and see.
> This should provoke it:
> - allocate nicely aligned bo in system memory
> - mmap, again nicely aligned to 2M
> - do some direct io from a filesystem into that mmap, that should trigger gup
> - before the gup completes free the mmap and bo so that ttm recycles
> the pages, which should trip up on the elevated refcount. If you wait
> until the direct io is completely, then I think nothing bad can be
> observed.
>
> Ofc if your amdgpu+hugepte issue is something else, then maybe we have
> another issue.
>
> Also usual caveat: I'm not an mm hacker either, so might be completely wrong.
> -Daniel

So I did the following quick experiment on vmwgfx, and it turns out that 
with it,
fast gup never succeeds. Without the "| PFN_MAP", it typically succeeds

I should probably craft an RFC formalizing this.

/Thomas

diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c 
b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 6dc96cf66744..72b6fb17c984 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -195,6 +195,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct 
vm_fault *vmf,
         pfn_t pfnt;
         struct ttm_tt *ttm = bo->ttm;
         bool write = vmf->flags & FAULT_FLAG_WRITE;
+       struct dev_pagemap *pagemap;

         /* Fault should not cross bo boundary. */
         page_offset &= ~(fault_page_size - 1);
@@ -210,6 +211,17 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct 
vm_fault *vmf,
         if ((pfn & (fault_page_size - 1)) != 0)
                 goto out_fallback;

+       /*
+        * Huge entries must be special, that is marking them as devmap
+        * with no backing device map range. If there is a backing
+        * range, Don't insert a huge entry.
+        */
+       pagemap = get_dev_pagemap(pfn, NULL);
+       if (pagemap) {
+               put_dev_pagemap(pagemap);
+               goto out_fallback;
+       }
+
         /* Check that memory is contiguous. */
         if (!bo->mem.bus.is_iomem) {
                 for (i = 1; i < fault_page_size; ++i) {
@@ -223,7 +235,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct 
vm_fault *vmf,
                 }
         }

-       pfnt = __pfn_to_pfn_t(pfn, PFN_DEV);
+       pfnt = __pfn_to_pfn_t(pfn, PFN_DEV | PFN_MAP);
         if (fault_page_size == (HPAGE_PMD_SIZE >> PAGE_SHIFT))
                 ret = vmf_insert_pfn_pmd_prot(vmf, pfnt, pgprot, write);
  #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
@@ -236,6 +248,21 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct 
vm_fault *vmf,
         if (ret != VM_FAULT_NOPAGE)
                 goto out_fallback;

+#if 1
+       {
+               int npages;
+               struct page *page;
+
+               npages = get_user_pages_fast_only(vmf->address, 1, 0, 
&page);
+               if (npages == 1) {
+                       DRM_WARN("Fast gup succeeded. Bad.\n");
+                       put_page(page);
+               } else {
+                       DRM_INFO("Fast gup failed. Good.\n");
+               }
+       }
+#endif
+
         return VM_FAULT_NOPAGE;
  out_fallback:
         count_vm_event(THP_FAULT_FALLBACK);
Daniel Vetter March 11, 2021, 1 p.m. UTC | #22
On Thu, Mar 11, 2021 at 11:22:06AM +0100, Thomas Hellström (Intel) wrote:
> 
> On 3/1/21 3:09 PM, Daniel Vetter wrote:
> > On Mon, Mar 1, 2021 at 11:17 AM Christian König
> > <christian.koenig@amd.com> wrote:
> > > 
> > > 
> > > Am 01.03.21 um 10:21 schrieb Thomas Hellström (Intel):
> > > > On 3/1/21 10:05 AM, Daniel Vetter wrote:
> > > > > On Mon, Mar 01, 2021 at 09:39:53AM +0100, Thomas Hellström (Intel)
> > > > > wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > On 3/1/21 9:28 AM, Daniel Vetter wrote:
> > > > > > > On Sat, Feb 27, 2021 at 9:06 AM Thomas Hellström (Intel)
> > > > > > > <thomas_os@shipmail.org> wrote:
> > > > > > > > On 2/26/21 2:28 PM, Daniel Vetter wrote:
> > > > > > > > > So I think it stops gup. But I haven't verified at all. Would be
> > > > > > > > > good
> > > > > > > > > if Christian can check this with some direct io to a buffer in
> > > > > > > > > system
> > > > > > > > > memory.
> > > > > > > > Hmm,
> > > > > > > > 
> > > > > > > > Docs (again vm_normal_page() say)
> > > > > > > > 
> > > > > > > >      * VM_MIXEDMAP mappings can likewise contain memory with or
> > > > > > > > without "struct
> > > > > > > >      * page" backing, however the difference is that _all_ pages
> > > > > > > > with a struct
> > > > > > > >      * page (that is, those where pfn_valid is true) are refcounted
> > > > > > > > and
> > > > > > > > considered
> > > > > > > >      * normal pages by the VM. The disadvantage is that pages are
> > > > > > > > refcounted
> > > > > > > >      * (which can be slower and simply not an option for some PFNMAP
> > > > > > > > users). The
> > > > > > > >      * advantage is that we don't have to follow the strict
> > > > > > > > linearity rule of
> > > > > > > >      * PFNMAP mappings in order to support COWable mappings.
> > > > > > > > 
> > > > > > > > but it's true __vm_insert_mixed() ends up in the insert_pfn()
> > > > > > > > path, so
> > > > > > > > the above isn't really true, which makes me wonder if and in that
> > > > > > > > case
> > > > > > > > why there could any longer ever be a significant performance
> > > > > > > > difference
> > > > > > > > between MIXEDMAP and PFNMAP.
> > > > > > > Yeah it's definitely confusing. I guess I'll hack up a patch and see
> > > > > > > what sticks.
> > > > > > > 
> > > > > > > > BTW regarding the TTM hugeptes, I don't think we ever landed that
> > > > > > > > devmap
> > > > > > > > hack, so they are (for the non-gup case) relying on
> > > > > > > > vma_is_special_huge(). For the gup case, I think the bug is still
> > > > > > > > there.
> > > > > > > Maybe there's another devmap hack, but the ttm_vm_insert functions do
> > > > > > > use PFN_DEV and all that. And I think that stops gup_fast from trying
> > > > > > > to find the underlying page.
> > > > > > > -Daniel
> > > > > > Hmm perhaps it might, but I don't think so. The fix I tried out was
> > > > > > to set
> > > > > > 
> > > > > > PFN_DEV | PFN_MAP for huge PTEs which causes pfn_devmap() to be
> > > > > > true, and
> > > > > > then
> > > > > > 
> > > > > > follow_devmap_pmd()->get_dev_pagemap() which returns NULL and
> > > > > > gup_fast()
> > > > > > backs off,
> > > > > > 
> > > > > > in the end that would mean setting in stone that "if there is a huge
> > > > > > devmap
> > > > > > page table entry for which we haven't registered any devmap struct
> > > > > > pages
> > > > > > (get_dev_pagemap returns NULL), we should treat that as a "special"
> > > > > > huge
> > > > > > page table entry".
> > > > > > 
> > > > > >   From what I can tell, all code calling get_dev_pagemap() already
> > > > > > does that,
> > > > > > it's just a question of getting it accepted and formalizing it.
> > > > > Oh I thought that's already how it works, since I didn't spot anything
> > > > > else that would block gup_fast from falling over. I guess really would
> > > > > need some testcases to make sure direct i/o (that's the easiest to test)
> > > > > fails like we expect.
> > > > Yeah, IIRC the "| PFN_MAP" is the missing piece for TTM huge ptes.
> > > > Otherwise pmd_devmap() will not return true and since there is no
> > > > pmd_special() things break.
> > > Is that maybe the issue we have seen with amdgpu and huge pages?
> > Yeah, essentially when you have a hugepte inserted by ttm, and it
> > happens to point at system memory, then gup will work on that. And
> > create all kinds of havoc.
> > 
> > > Apart from that I'm lost guys, that devmap and gup stuff is not
> > > something I have a good knowledge of apart from a one mile high view.
> > I'm not really better, hence would be good to do a testcase and see.
> > This should provoke it:
> > - allocate nicely aligned bo in system memory
> > - mmap, again nicely aligned to 2M
> > - do some direct io from a filesystem into that mmap, that should trigger gup
> > - before the gup completes free the mmap and bo so that ttm recycles
> > the pages, which should trip up on the elevated refcount. If you wait
> > until the direct io is completely, then I think nothing bad can be
> > observed.
> > 
> > Ofc if your amdgpu+hugepte issue is something else, then maybe we have
> > another issue.
> > 
> > Also usual caveat: I'm not an mm hacker either, so might be completely wrong.
> > -Daniel
> 
> So I did the following quick experiment on vmwgfx, and it turns out that
> with it,
> fast gup never succeeds. Without the "| PFN_MAP", it typically succeeds
> 
> I should probably craft an RFC formalizing this.

Yeah I think that would be good. Maybe even more formalized if we also
switch over to VM_PFNMAP, since afaiui these pte flags here only stop the
fast gup path. And slow gup can still peak through VM_MIXEDMAP. Or
something like that.

Otoh your description of when it only sometimes succeeds would indicate my
understanding of VM_PFNMAP vs VM_MIXEDMAP is wrong here.

Christian, what's your take?
-Daniel

> 
> /Thomas
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c
> b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> index 6dc96cf66744..72b6fb17c984 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> @@ -195,6 +195,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
> *vmf,
>         pfn_t pfnt;
>         struct ttm_tt *ttm = bo->ttm;
>         bool write = vmf->flags & FAULT_FLAG_WRITE;
> +       struct dev_pagemap *pagemap;
> 
>         /* Fault should not cross bo boundary. */
>         page_offset &= ~(fault_page_size - 1);
> @@ -210,6 +211,17 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
> *vmf,
>         if ((pfn & (fault_page_size - 1)) != 0)
>                 goto out_fallback;
> 
> +       /*
> +        * Huge entries must be special, that is marking them as devmap
> +        * with no backing device map range. If there is a backing
> +        * range, Don't insert a huge entry.
> +        */
> +       pagemap = get_dev_pagemap(pfn, NULL);
> +       if (pagemap) {
> +               put_dev_pagemap(pagemap);
> +               goto out_fallback;
> +       }
> +
>         /* Check that memory is contiguous. */
>         if (!bo->mem.bus.is_iomem) {
>                 for (i = 1; i < fault_page_size; ++i) {
> @@ -223,7 +235,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
> *vmf,
>                 }
>         }
> 
> -       pfnt = __pfn_to_pfn_t(pfn, PFN_DEV);
> +       pfnt = __pfn_to_pfn_t(pfn, PFN_DEV | PFN_MAP);
>         if (fault_page_size == (HPAGE_PMD_SIZE >> PAGE_SHIFT))
>                 ret = vmf_insert_pfn_pmd_prot(vmf, pfnt, pgprot, write);
>  #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> @@ -236,6 +248,21 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
> *vmf,
>         if (ret != VM_FAULT_NOPAGE)
>                 goto out_fallback;
> 
> +#if 1
> +       {
> +               int npages;
> +               struct page *page;
> +
> +               npages = get_user_pages_fast_only(vmf->address, 1, 0,
> &page);
> +               if (npages == 1) {
> +                       DRM_WARN("Fast gup succeeded. Bad.\n");
> +                       put_page(page);
> +               } else {
> +                       DRM_INFO("Fast gup failed. Good.\n");
> +               }
> +       }
> +#endif
> +
>         return VM_FAULT_NOPAGE;
>  out_fallback:
>         count_vm_event(THP_FAULT_FALLBACK);
> 
> 
> 
> 
>
Thomas Hellström (Intel) March 11, 2021, 1:12 p.m. UTC | #23
Hi!

On 3/11/21 2:00 PM, Daniel Vetter wrote:
> On Thu, Mar 11, 2021 at 11:22:06AM +0100, Thomas Hellström (Intel) wrote:
>> On 3/1/21 3:09 PM, Daniel Vetter wrote:
>>> On Mon, Mar 1, 2021 at 11:17 AM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>>
>>>> Am 01.03.21 um 10:21 schrieb Thomas Hellström (Intel):
>>>>> On 3/1/21 10:05 AM, Daniel Vetter wrote:
>>>>>> On Mon, Mar 01, 2021 at 09:39:53AM +0100, Thomas Hellström (Intel)
>>>>>> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 3/1/21 9:28 AM, Daniel Vetter wrote:
>>>>>>>> On Sat, Feb 27, 2021 at 9:06 AM Thomas Hellström (Intel)
>>>>>>>> <thomas_os@shipmail.org> wrote:
>>>>>>>>> On 2/26/21 2:28 PM, Daniel Vetter wrote:
>>>>>>>>>> So I think it stops gup. But I haven't verified at all. Would be
>>>>>>>>>> good
>>>>>>>>>> if Christian can check this with some direct io to a buffer in
>>>>>>>>>> system
>>>>>>>>>> memory.
>>>>>>>>> Hmm,
>>>>>>>>>
>>>>>>>>> Docs (again vm_normal_page() say)
>>>>>>>>>
>>>>>>>>>       * VM_MIXEDMAP mappings can likewise contain memory with or
>>>>>>>>> without "struct
>>>>>>>>>       * page" backing, however the difference is that _all_ pages
>>>>>>>>> with a struct
>>>>>>>>>       * page (that is, those where pfn_valid is true) are refcounted
>>>>>>>>> and
>>>>>>>>> considered
>>>>>>>>>       * normal pages by the VM. The disadvantage is that pages are
>>>>>>>>> refcounted
>>>>>>>>>       * (which can be slower and simply not an option for some PFNMAP
>>>>>>>>> users). The
>>>>>>>>>       * advantage is that we don't have to follow the strict
>>>>>>>>> linearity rule of
>>>>>>>>>       * PFNMAP mappings in order to support COWable mappings.
>>>>>>>>>
>>>>>>>>> but it's true __vm_insert_mixed() ends up in the insert_pfn()
>>>>>>>>> path, so
>>>>>>>>> the above isn't really true, which makes me wonder if and in that
>>>>>>>>> case
>>>>>>>>> why there could any longer ever be a significant performance
>>>>>>>>> difference
>>>>>>>>> between MIXEDMAP and PFNMAP.
>>>>>>>> Yeah it's definitely confusing. I guess I'll hack up a patch and see
>>>>>>>> what sticks.
>>>>>>>>
>>>>>>>>> BTW regarding the TTM hugeptes, I don't think we ever landed that
>>>>>>>>> devmap
>>>>>>>>> hack, so they are (for the non-gup case) relying on
>>>>>>>>> vma_is_special_huge(). For the gup case, I think the bug is still
>>>>>>>>> there.
>>>>>>>> Maybe there's another devmap hack, but the ttm_vm_insert functions do
>>>>>>>> use PFN_DEV and all that. And I think that stops gup_fast from trying
>>>>>>>> to find the underlying page.
>>>>>>>> -Daniel
>>>>>>> Hmm perhaps it might, but I don't think so. The fix I tried out was
>>>>>>> to set
>>>>>>>
>>>>>>> PFN_DEV | PFN_MAP for huge PTEs which causes pfn_devmap() to be
>>>>>>> true, and
>>>>>>> then
>>>>>>>
>>>>>>> follow_devmap_pmd()->get_dev_pagemap() which returns NULL and
>>>>>>> gup_fast()
>>>>>>> backs off,
>>>>>>>
>>>>>>> in the end that would mean setting in stone that "if there is a huge
>>>>>>> devmap
>>>>>>> page table entry for which we haven't registered any devmap struct
>>>>>>> pages
>>>>>>> (get_dev_pagemap returns NULL), we should treat that as a "special"
>>>>>>> huge
>>>>>>> page table entry".
>>>>>>>
>>>>>>>    From what I can tell, all code calling get_dev_pagemap() already
>>>>>>> does that,
>>>>>>> it's just a question of getting it accepted and formalizing it.
>>>>>> Oh I thought that's already how it works, since I didn't spot anything
>>>>>> else that would block gup_fast from falling over. I guess really would
>>>>>> need some testcases to make sure direct i/o (that's the easiest to test)
>>>>>> fails like we expect.
>>>>> Yeah, IIRC the "| PFN_MAP" is the missing piece for TTM huge ptes.
>>>>> Otherwise pmd_devmap() will not return true and since there is no
>>>>> pmd_special() things break.
>>>> Is that maybe the issue we have seen with amdgpu and huge pages?
>>> Yeah, essentially when you have a hugepte inserted by ttm, and it
>>> happens to point at system memory, then gup will work on that. And
>>> create all kinds of havoc.
>>>
>>>> Apart from that I'm lost guys, that devmap and gup stuff is not
>>>> something I have a good knowledge of apart from a one mile high view.
>>> I'm not really better, hence would be good to do a testcase and see.
>>> This should provoke it:
>>> - allocate nicely aligned bo in system memory
>>> - mmap, again nicely aligned to 2M
>>> - do some direct io from a filesystem into that mmap, that should trigger gup
>>> - before the gup completes free the mmap and bo so that ttm recycles
>>> the pages, which should trip up on the elevated refcount. If you wait
>>> until the direct io is completely, then I think nothing bad can be
>>> observed.
>>>
>>> Ofc if your amdgpu+hugepte issue is something else, then maybe we have
>>> another issue.
>>>
>>> Also usual caveat: I'm not an mm hacker either, so might be completely wrong.
>>> -Daniel
>> So I did the following quick experiment on vmwgfx, and it turns out that
>> with it,
>> fast gup never succeeds. Without the "| PFN_MAP", it typically succeeds
>>
>> I should probably craft an RFC formalizing this.
> Yeah I think that would be good. Maybe even more formalized if we also
> switch over to VM_PFNMAP, since afaiui these pte flags here only stop the
> fast gup path. And slow gup can still peak through VM_MIXEDMAP. Or
> something like that.
>
> Otoh your description of when it only sometimes succeeds would indicate my
> understanding of VM_PFNMAP vs VM_MIXEDMAP is wrong here.

My understanding from reading the vmf_insert_mixed() code is that iff 
the arch has pte_special(), VM_MIXEDMAP should be harmless. But that's 
not consistent with the vm_normal_page() doc. For architectures without 
pte_special, VM_PFNMAP must be used, and then we must also block COW 
mappings.

If we can get someone can commit to verify that the potential PAT WC 
performance issue is gone with PFNMAP, I can put together a series with 
that included.

As for existing userspace using COW TTM mappings, I once had a couple of 
test cases to verify that it actually worked, in particular together 
with huge PMDs and PUDs where breaking COW would imply splitting those, 
but I can't think of anything else actually wanting to do that other 
than by mistake.

/Thomas


>
> Christian, what's your take?
> -Daniel
>
>> /Thomas
>>
>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> index 6dc96cf66744..72b6fb17c984 100644
>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> @@ -195,6 +195,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
>> *vmf,
>>          pfn_t pfnt;
>>          struct ttm_tt *ttm = bo->ttm;
>>          bool write = vmf->flags & FAULT_FLAG_WRITE;
>> +       struct dev_pagemap *pagemap;
>>
>>          /* Fault should not cross bo boundary. */
>>          page_offset &= ~(fault_page_size - 1);
>> @@ -210,6 +211,17 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
>> *vmf,
>>          if ((pfn & (fault_page_size - 1)) != 0)
>>                  goto out_fallback;
>>
>> +       /*
>> +        * Huge entries must be special, that is marking them as devmap
>> +        * with no backing device map range. If there is a backing
>> +        * range, Don't insert a huge entry.
>> +        */
>> +       pagemap = get_dev_pagemap(pfn, NULL);
>> +       if (pagemap) {
>> +               put_dev_pagemap(pagemap);
>> +               goto out_fallback;
>> +       }
>> +
>>          /* Check that memory is contiguous. */
>>          if (!bo->mem.bus.is_iomem) {
>>                  for (i = 1; i < fault_page_size; ++i) {
>> @@ -223,7 +235,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
>> *vmf,
>>                  }
>>          }
>>
>> -       pfnt = __pfn_to_pfn_t(pfn, PFN_DEV);
>> +       pfnt = __pfn_to_pfn_t(pfn, PFN_DEV | PFN_MAP);
>>          if (fault_page_size == (HPAGE_PMD_SIZE >> PAGE_SHIFT))
>>                  ret = vmf_insert_pfn_pmd_prot(vmf, pfnt, pgprot, write);
>>   #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>> @@ -236,6 +248,21 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
>> *vmf,
>>          if (ret != VM_FAULT_NOPAGE)
>>                  goto out_fallback;
>>
>> +#if 1
>> +       {
>> +               int npages;
>> +               struct page *page;
>> +
>> +               npages = get_user_pages_fast_only(vmf->address, 1, 0,
>> &page);
>> +               if (npages == 1) {
>> +                       DRM_WARN("Fast gup succeeded. Bad.\n");
>> +                       put_page(page);
>> +               } else {
>> +                       DRM_INFO("Fast gup failed. Good.\n");
>> +               }
>> +       }
>> +#endif
>> +
>>          return VM_FAULT_NOPAGE;
>>   out_fallback:
>>          count_vm_event(THP_FAULT_FALLBACK);
>>
>>
>>
>>
>>
Daniel Vetter March 11, 2021, 1:17 p.m. UTC | #24
On Thu, Mar 11, 2021 at 2:12 PM Thomas Hellström (Intel)
<thomas_os@shipmail.org> wrote:
>
> Hi!
>
> On 3/11/21 2:00 PM, Daniel Vetter wrote:
> > On Thu, Mar 11, 2021 at 11:22:06AM +0100, Thomas Hellström (Intel) wrote:
> >> On 3/1/21 3:09 PM, Daniel Vetter wrote:
> >>> On Mon, Mar 1, 2021 at 11:17 AM Christian König
> >>> <christian.koenig@amd.com> wrote:
> >>>>
> >>>> Am 01.03.21 um 10:21 schrieb Thomas Hellström (Intel):
> >>>>> On 3/1/21 10:05 AM, Daniel Vetter wrote:
> >>>>>> On Mon, Mar 01, 2021 at 09:39:53AM +0100, Thomas Hellström (Intel)
> >>>>>> wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> On 3/1/21 9:28 AM, Daniel Vetter wrote:
> >>>>>>>> On Sat, Feb 27, 2021 at 9:06 AM Thomas Hellström (Intel)
> >>>>>>>> <thomas_os@shipmail.org> wrote:
> >>>>>>>>> On 2/26/21 2:28 PM, Daniel Vetter wrote:
> >>>>>>>>>> So I think it stops gup. But I haven't verified at all. Would be
> >>>>>>>>>> good
> >>>>>>>>>> if Christian can check this with some direct io to a buffer in
> >>>>>>>>>> system
> >>>>>>>>>> memory.
> >>>>>>>>> Hmm,
> >>>>>>>>>
> >>>>>>>>> Docs (again vm_normal_page() say)
> >>>>>>>>>
> >>>>>>>>>       * VM_MIXEDMAP mappings can likewise contain memory with or
> >>>>>>>>> without "struct
> >>>>>>>>>       * page" backing, however the difference is that _all_ pages
> >>>>>>>>> with a struct
> >>>>>>>>>       * page (that is, those where pfn_valid is true) are refcounted
> >>>>>>>>> and
> >>>>>>>>> considered
> >>>>>>>>>       * normal pages by the VM. The disadvantage is that pages are
> >>>>>>>>> refcounted
> >>>>>>>>>       * (which can be slower and simply not an option for some PFNMAP
> >>>>>>>>> users). The
> >>>>>>>>>       * advantage is that we don't have to follow the strict
> >>>>>>>>> linearity rule of
> >>>>>>>>>       * PFNMAP mappings in order to support COWable mappings.
> >>>>>>>>>
> >>>>>>>>> but it's true __vm_insert_mixed() ends up in the insert_pfn()
> >>>>>>>>> path, so
> >>>>>>>>> the above isn't really true, which makes me wonder if and in that
> >>>>>>>>> case
> >>>>>>>>> why there could any longer ever be a significant performance
> >>>>>>>>> difference
> >>>>>>>>> between MIXEDMAP and PFNMAP.
> >>>>>>>> Yeah it's definitely confusing. I guess I'll hack up a patch and see
> >>>>>>>> what sticks.
> >>>>>>>>
> >>>>>>>>> BTW regarding the TTM hugeptes, I don't think we ever landed that
> >>>>>>>>> devmap
> >>>>>>>>> hack, so they are (for the non-gup case) relying on
> >>>>>>>>> vma_is_special_huge(). For the gup case, I think the bug is still
> >>>>>>>>> there.
> >>>>>>>> Maybe there's another devmap hack, but the ttm_vm_insert functions do
> >>>>>>>> use PFN_DEV and all that. And I think that stops gup_fast from trying
> >>>>>>>> to find the underlying page.
> >>>>>>>> -Daniel
> >>>>>>> Hmm perhaps it might, but I don't think so. The fix I tried out was
> >>>>>>> to set
> >>>>>>>
> >>>>>>> PFN_DEV | PFN_MAP for huge PTEs which causes pfn_devmap() to be
> >>>>>>> true, and
> >>>>>>> then
> >>>>>>>
> >>>>>>> follow_devmap_pmd()->get_dev_pagemap() which returns NULL and
> >>>>>>> gup_fast()
> >>>>>>> backs off,
> >>>>>>>
> >>>>>>> in the end that would mean setting in stone that "if there is a huge
> >>>>>>> devmap
> >>>>>>> page table entry for which we haven't registered any devmap struct
> >>>>>>> pages
> >>>>>>> (get_dev_pagemap returns NULL), we should treat that as a "special"
> >>>>>>> huge
> >>>>>>> page table entry".
> >>>>>>>
> >>>>>>>    From what I can tell, all code calling get_dev_pagemap() already
> >>>>>>> does that,
> >>>>>>> it's just a question of getting it accepted and formalizing it.
> >>>>>> Oh I thought that's already how it works, since I didn't spot anything
> >>>>>> else that would block gup_fast from falling over. I guess really would
> >>>>>> need some testcases to make sure direct i/o (that's the easiest to test)
> >>>>>> fails like we expect.
> >>>>> Yeah, IIRC the "| PFN_MAP" is the missing piece for TTM huge ptes.
> >>>>> Otherwise pmd_devmap() will not return true and since there is no
> >>>>> pmd_special() things break.
> >>>> Is that maybe the issue we have seen with amdgpu and huge pages?
> >>> Yeah, essentially when you have a hugepte inserted by ttm, and it
> >>> happens to point at system memory, then gup will work on that. And
> >>> create all kinds of havoc.
> >>>
> >>>> Apart from that I'm lost guys, that devmap and gup stuff is not
> >>>> something I have a good knowledge of apart from a one mile high view.
> >>> I'm not really better, hence would be good to do a testcase and see.
> >>> This should provoke it:
> >>> - allocate nicely aligned bo in system memory
> >>> - mmap, again nicely aligned to 2M
> >>> - do some direct io from a filesystem into that mmap, that should trigger gup
> >>> - before the gup completes free the mmap and bo so that ttm recycles
> >>> the pages, which should trip up on the elevated refcount. If you wait
> >>> until the direct io is completely, then I think nothing bad can be
> >>> observed.
> >>>
> >>> Ofc if your amdgpu+hugepte issue is something else, then maybe we have
> >>> another issue.
> >>>
> >>> Also usual caveat: I'm not an mm hacker either, so might be completely wrong.
> >>> -Daniel
> >> So I did the following quick experiment on vmwgfx, and it turns out that
> >> with it,
> >> fast gup never succeeds. Without the "| PFN_MAP", it typically succeeds
> >>
> >> I should probably craft an RFC formalizing this.
> > Yeah I think that would be good. Maybe even more formalized if we also
> > switch over to VM_PFNMAP, since afaiui these pte flags here only stop the
> > fast gup path. And slow gup can still peak through VM_MIXEDMAP. Or
> > something like that.
> >
> > Otoh your description of when it only sometimes succeeds would indicate my
> > understanding of VM_PFNMAP vs VM_MIXEDMAP is wrong here.
>
> My understanding from reading the vmf_insert_mixed() code is that iff
> the arch has pte_special(), VM_MIXEDMAP should be harmless. But that's
> not consistent with the vm_normal_page() doc. For architectures without
> pte_special, VM_PFNMAP must be used, and then we must also block COW
> mappings.
>
> If we can get someone can commit to verify that the potential PAT WC
> performance issue is gone with PFNMAP, I can put together a series with
> that included.

Iirc when I checked there's not much archs without pte_special, so I
guess that's why we luck out. Hopefully.

> As for existing userspace using COW TTM mappings, I once had a couple of
> test cases to verify that it actually worked, in particular together
> with huge PMDs and PUDs where breaking COW would imply splitting those,
> but I can't think of anything else actually wanting to do that other
> than by mistake.

Yeah disallowing MAP_PRIVATE mappings would be another good thing to
lock down. Really doesn't make much sense.
-Daniel

> /Thomas
>
>
> >
> > Christian, what's your take?
> > -Daniel
> >
> >> /Thomas
> >>
> >> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c
> >> b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> >> index 6dc96cf66744..72b6fb17c984 100644
> >> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
> >> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> >> @@ -195,6 +195,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
> >> *vmf,
> >>          pfn_t pfnt;
> >>          struct ttm_tt *ttm = bo->ttm;
> >>          bool write = vmf->flags & FAULT_FLAG_WRITE;
> >> +       struct dev_pagemap *pagemap;
> >>
> >>          /* Fault should not cross bo boundary. */
> >>          page_offset &= ~(fault_page_size - 1);
> >> @@ -210,6 +211,17 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
> >> *vmf,
> >>          if ((pfn & (fault_page_size - 1)) != 0)
> >>                  goto out_fallback;
> >>
> >> +       /*
> >> +        * Huge entries must be special, that is marking them as devmap
> >> +        * with no backing device map range. If there is a backing
> >> +        * range, Don't insert a huge entry.
> >> +        */
> >> +       pagemap = get_dev_pagemap(pfn, NULL);
> >> +       if (pagemap) {
> >> +               put_dev_pagemap(pagemap);
> >> +               goto out_fallback;
> >> +       }
> >> +
> >>          /* Check that memory is contiguous. */
> >>          if (!bo->mem.bus.is_iomem) {
> >>                  for (i = 1; i < fault_page_size; ++i) {
> >> @@ -223,7 +235,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
> >> *vmf,
> >>                  }
> >>          }
> >>
> >> -       pfnt = __pfn_to_pfn_t(pfn, PFN_DEV);
> >> +       pfnt = __pfn_to_pfn_t(pfn, PFN_DEV | PFN_MAP);
> >>          if (fault_page_size == (HPAGE_PMD_SIZE >> PAGE_SHIFT))
> >>                  ret = vmf_insert_pfn_pmd_prot(vmf, pfnt, pgprot, write);
> >>   #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >> @@ -236,6 +248,21 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
> >> *vmf,
> >>          if (ret != VM_FAULT_NOPAGE)
> >>                  goto out_fallback;
> >>
> >> +#if 1
> >> +       {
> >> +               int npages;
> >> +               struct page *page;
> >> +
> >> +               npages = get_user_pages_fast_only(vmf->address, 1, 0,
> >> &page);
> >> +               if (npages == 1) {
> >> +                       DRM_WARN("Fast gup succeeded. Bad.\n");
> >> +                       put_page(page);
> >> +               } else {
> >> +                       DRM_INFO("Fast gup failed. Good.\n");
> >> +               }
> >> +       }
> >> +#endif
> >> +
> >>          return VM_FAULT_NOPAGE;
> >>   out_fallback:
> >>          count_vm_event(THP_FAULT_FALLBACK);
> >>
> >>
> >>
> >>
> >>
Thomas Hellström (Intel) March 11, 2021, 3:37 p.m. UTC | #25
On 3/11/21 2:17 PM, Daniel Vetter wrote:
> On Thu, Mar 11, 2021 at 2:12 PM Thomas Hellström (Intel)
> <thomas_os@shipmail.org> wrote:
>> Hi!
>>
>> On 3/11/21 2:00 PM, Daniel Vetter wrote:
>>> On Thu, Mar 11, 2021 at 11:22:06AM +0100, Thomas Hellström (Intel) wrote:
>>>> On 3/1/21 3:09 PM, Daniel Vetter wrote:
>>>>> On Mon, Mar 1, 2021 at 11:17 AM Christian König
>>>>> <christian.koenig@amd.com> wrote:
>>>>>> Am 01.03.21 um 10:21 schrieb Thomas Hellström (Intel):
>>>>>>> On 3/1/21 10:05 AM, Daniel Vetter wrote:
>>>>>>>> On Mon, Mar 01, 2021 at 09:39:53AM +0100, Thomas Hellström (Intel)
>>>>>>>> wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> On 3/1/21 9:28 AM, Daniel Vetter wrote:
>>>>>>>>>> On Sat, Feb 27, 2021 at 9:06 AM Thomas Hellström (Intel)
>>>>>>>>>> <thomas_os@shipmail.org> wrote:
>>>>>>>>>>> On 2/26/21 2:28 PM, Daniel Vetter wrote:
>>>>>>>>>>>> So I think it stops gup. But I haven't verified at all. Would be
>>>>>>>>>>>> good
>>>>>>>>>>>> if Christian can check this with some direct io to a buffer in
>>>>>>>>>>>> system
>>>>>>>>>>>> memory.
>>>>>>>>>>> Hmm,
>>>>>>>>>>>
>>>>>>>>>>> Docs (again vm_normal_page() say)
>>>>>>>>>>>
>>>>>>>>>>>        * VM_MIXEDMAP mappings can likewise contain memory with or
>>>>>>>>>>> without "struct
>>>>>>>>>>>        * page" backing, however the difference is that _all_ pages
>>>>>>>>>>> with a struct
>>>>>>>>>>>        * page (that is, those where pfn_valid is true) are refcounted
>>>>>>>>>>> and
>>>>>>>>>>> considered
>>>>>>>>>>>        * normal pages by the VM. The disadvantage is that pages are
>>>>>>>>>>> refcounted
>>>>>>>>>>>        * (which can be slower and simply not an option for some PFNMAP
>>>>>>>>>>> users). The
>>>>>>>>>>>        * advantage is that we don't have to follow the strict
>>>>>>>>>>> linearity rule of
>>>>>>>>>>>        * PFNMAP mappings in order to support COWable mappings.
>>>>>>>>>>>
>>>>>>>>>>> but it's true __vm_insert_mixed() ends up in the insert_pfn()
>>>>>>>>>>> path, so
>>>>>>>>>>> the above isn't really true, which makes me wonder if and in that
>>>>>>>>>>> case
>>>>>>>>>>> why there could any longer ever be a significant performance
>>>>>>>>>>> difference
>>>>>>>>>>> between MIXEDMAP and PFNMAP.
>>>>>>>>>> Yeah it's definitely confusing. I guess I'll hack up a patch and see
>>>>>>>>>> what sticks.
>>>>>>>>>>
>>>>>>>>>>> BTW regarding the TTM hugeptes, I don't think we ever landed that
>>>>>>>>>>> devmap
>>>>>>>>>>> hack, so they are (for the non-gup case) relying on
>>>>>>>>>>> vma_is_special_huge(). For the gup case, I think the bug is still
>>>>>>>>>>> there.
>>>>>>>>>> Maybe there's another devmap hack, but the ttm_vm_insert functions do
>>>>>>>>>> use PFN_DEV and all that. And I think that stops gup_fast from trying
>>>>>>>>>> to find the underlying page.
>>>>>>>>>> -Daniel
>>>>>>>>> Hmm perhaps it might, but I don't think so. The fix I tried out was
>>>>>>>>> to set
>>>>>>>>>
>>>>>>>>> PFN_DEV | PFN_MAP for huge PTEs which causes pfn_devmap() to be
>>>>>>>>> true, and
>>>>>>>>> then
>>>>>>>>>
>>>>>>>>> follow_devmap_pmd()->get_dev_pagemap() which returns NULL and
>>>>>>>>> gup_fast()
>>>>>>>>> backs off,
>>>>>>>>>
>>>>>>>>> in the end that would mean setting in stone that "if there is a huge
>>>>>>>>> devmap
>>>>>>>>> page table entry for which we haven't registered any devmap struct
>>>>>>>>> pages
>>>>>>>>> (get_dev_pagemap returns NULL), we should treat that as a "special"
>>>>>>>>> huge
>>>>>>>>> page table entry".
>>>>>>>>>
>>>>>>>>>     From what I can tell, all code calling get_dev_pagemap() already
>>>>>>>>> does that,
>>>>>>>>> it's just a question of getting it accepted and formalizing it.
>>>>>>>> Oh I thought that's already how it works, since I didn't spot anything
>>>>>>>> else that would block gup_fast from falling over. I guess really would
>>>>>>>> need some testcases to make sure direct i/o (that's the easiest to test)
>>>>>>>> fails like we expect.
>>>>>>> Yeah, IIRC the "| PFN_MAP" is the missing piece for TTM huge ptes.
>>>>>>> Otherwise pmd_devmap() will not return true and since there is no
>>>>>>> pmd_special() things break.
>>>>>> Is that maybe the issue we have seen with amdgpu and huge pages?
>>>>> Yeah, essentially when you have a hugepte inserted by ttm, and it
>>>>> happens to point at system memory, then gup will work on that. And
>>>>> create all kinds of havoc.
>>>>>
>>>>>> Apart from that I'm lost guys, that devmap and gup stuff is not
>>>>>> something I have a good knowledge of apart from a one mile high view.
>>>>> I'm not really better, hence would be good to do a testcase and see.
>>>>> This should provoke it:
>>>>> - allocate nicely aligned bo in system memory
>>>>> - mmap, again nicely aligned to 2M
>>>>> - do some direct io from a filesystem into that mmap, that should trigger gup
>>>>> - before the gup completes free the mmap and bo so that ttm recycles
>>>>> the pages, which should trip up on the elevated refcount. If you wait
>>>>> until the direct io is completely, then I think nothing bad can be
>>>>> observed.
>>>>>
>>>>> Ofc if your amdgpu+hugepte issue is something else, then maybe we have
>>>>> another issue.
>>>>>
>>>>> Also usual caveat: I'm not an mm hacker either, so might be completely wrong.
>>>>> -Daniel
>>>> So I did the following quick experiment on vmwgfx, and it turns out that
>>>> with it,
>>>> fast gup never succeeds. Without the "| PFN_MAP", it typically succeeds
>>>>
>>>> I should probably craft an RFC formalizing this.
>>> Yeah I think that would be good. Maybe even more formalized if we also
>>> switch over to VM_PFNMAP, since afaiui these pte flags here only stop the
>>> fast gup path. And slow gup can still peak through VM_MIXEDMAP. Or
>>> something like that.
>>>
>>> Otoh your description of when it only sometimes succeeds would indicate my
>>> understanding of VM_PFNMAP vs VM_MIXEDMAP is wrong here.
>> My understanding from reading the vmf_insert_mixed() code is that iff
>> the arch has pte_special(), VM_MIXEDMAP should be harmless. But that's
>> not consistent with the vm_normal_page() doc. For architectures without
>> pte_special, VM_PFNMAP must be used, and then we must also block COW
>> mappings.
>>
>> If we can get someone can commit to verify that the potential PAT WC
>> performance issue is gone with PFNMAP, I can put together a series with
>> that included.
> Iirc when I checked there's not much archs without pte_special, so I
> guess that's why we luck out. Hopefully.
>
>> As for existing userspace using COW TTM mappings, I once had a couple of
>> test cases to verify that it actually worked, in particular together
>> with huge PMDs and PUDs where breaking COW would imply splitting those,
>> but I can't think of anything else actually wanting to do that other
>> than by mistake.
> Yeah disallowing MAP_PRIVATE mappings would be another good thing to
> lock down. Really doesn't make much sense.
> -Daniel

Yes, we can't allow them with PFNMAP + a non-linear address space...

/Thomas


>> /Thomas
>>
>>
>>> Christian, what's your take?
>>> -Daniel
>>>
>>>> /Thomas
>>>>
>>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>> index 6dc96cf66744..72b6fb17c984 100644
>>>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>> @@ -195,6 +195,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
>>>> *vmf,
>>>>           pfn_t pfnt;
>>>>           struct ttm_tt *ttm = bo->ttm;
>>>>           bool write = vmf->flags & FAULT_FLAG_WRITE;
>>>> +       struct dev_pagemap *pagemap;
>>>>
>>>>           /* Fault should not cross bo boundary. */
>>>>           page_offset &= ~(fault_page_size - 1);
>>>> @@ -210,6 +211,17 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
>>>> *vmf,
>>>>           if ((pfn & (fault_page_size - 1)) != 0)
>>>>                   goto out_fallback;
>>>>
>>>> +       /*
>>>> +        * Huge entries must be special, that is marking them as devmap
>>>> +        * with no backing device map range. If there is a backing
>>>> +        * range, Don't insert a huge entry.
>>>> +        */
>>>> +       pagemap = get_dev_pagemap(pfn, NULL);
>>>> +       if (pagemap) {
>>>> +               put_dev_pagemap(pagemap);
>>>> +               goto out_fallback;
>>>> +       }
>>>> +
>>>>           /* Check that memory is contiguous. */
>>>>           if (!bo->mem.bus.is_iomem) {
>>>>                   for (i = 1; i < fault_page_size; ++i) {
>>>> @@ -223,7 +235,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
>>>> *vmf,
>>>>                   }
>>>>           }
>>>>
>>>> -       pfnt = __pfn_to_pfn_t(pfn, PFN_DEV);
>>>> +       pfnt = __pfn_to_pfn_t(pfn, PFN_DEV | PFN_MAP);
>>>>           if (fault_page_size == (HPAGE_PMD_SIZE >> PAGE_SHIFT))
>>>>                   ret = vmf_insert_pfn_pmd_prot(vmf, pfnt, pgprot, write);
>>>>    #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> @@ -236,6 +248,21 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
>>>> *vmf,
>>>>           if (ret != VM_FAULT_NOPAGE)
>>>>                   goto out_fallback;
>>>>
>>>> +#if 1
>>>> +       {
>>>> +               int npages;
>>>> +               struct page *page;
>>>> +
>>>> +               npages = get_user_pages_fast_only(vmf->address, 1, 0,
>>>> &page);
>>>> +               if (npages == 1) {
>>>> +                       DRM_WARN("Fast gup succeeded. Bad.\n");
>>>> +                       put_page(page);
>>>> +               } else {
>>>> +                       DRM_INFO("Fast gup failed. Good.\n");
>>>> +               }
>>>> +       }
>>>> +#endif
>>>> +
>>>>           return VM_FAULT_NOPAGE;
>>>>    out_fallback:
>>>>           count_vm_event(THP_FAULT_FALLBACK);
>>>>
>>>>
>>>>
>>>>
>>>>
>
>
Christian König March 12, 2021, 7:51 a.m. UTC | #26
Am 11.03.21 um 14:17 schrieb Daniel Vetter:
> [SNIP]
>>>> So I did the following quick experiment on vmwgfx, and it turns out that
>>>> with it,
>>>> fast gup never succeeds. Without the "| PFN_MAP", it typically succeeds
>>>>
>>>> I should probably craft an RFC formalizing this.
>>> Yeah I think that would be good. Maybe even more formalized if we also
>>> switch over to VM_PFNMAP, since afaiui these pte flags here only stop the
>>> fast gup path. And slow gup can still peak through VM_MIXEDMAP. Or
>>> something like that.
>>>
>>> Otoh your description of when it only sometimes succeeds would indicate my
>>> understanding of VM_PFNMAP vs VM_MIXEDMAP is wrong here.
>> My understanding from reading the vmf_insert_mixed() code is that iff
>> the arch has pte_special(), VM_MIXEDMAP should be harmless. But that's
>> not consistent with the vm_normal_page() doc. For architectures without
>> pte_special, VM_PFNMAP must be used, and then we must also block COW
>> mappings.
>>
>> If we can get someone can commit to verify that the potential PAT WC
>> performance issue is gone with PFNMAP, I can put together a series with
>> that included.
> Iirc when I checked there's not much archs without pte_special, so I
> guess that's why we luck out. Hopefully.

I still need to read up a bit on what you guys are discussing here, but 
it starts to make a picture. Especially my understanding of what 
VM_MIXEDMAP means seems to have been slightly of.

I would say just go ahead and provide patches to always use VM_PFNMAP in 
TTM and we can test it and see if there are still some issues.

>> As for existing userspace using COW TTM mappings, I once had a couple of
>> test cases to verify that it actually worked, in particular together
>> with huge PMDs and PUDs where breaking COW would imply splitting those,
>> but I can't think of anything else actually wanting to do that other
>> than by mistake.
> Yeah disallowing MAP_PRIVATE mappings would be another good thing to
> lock down. Really doesn't make much sense.

Completely agree. That sounds like something we should try to avoid.

Regards,
Christian.

> -Daniel
>
diff mbox series

Patch

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index f264b70c383e..06cb1d2e9fdc 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -127,6 +127,7 @@  static struct file_system_type dma_buf_fs_type = {
 static int dma_buf_mmap_internal(struct file *file, struct vm_area_struct *vma)
 {
 	struct dma_buf *dmabuf;
+	int ret;
 
 	if (!is_dma_buf_file(file))
 		return -EINVAL;
@@ -142,7 +143,11 @@  static int dma_buf_mmap_internal(struct file *file, struct vm_area_struct *vma)
 	    dmabuf->size >> PAGE_SHIFT)
 		return -EINVAL;
 
-	return dmabuf->ops->mmap(dmabuf, vma);
+	ret = dmabuf->ops->mmap(dmabuf, vma);
+
+	WARN_ON(!(vma->vm_flags & VM_PFNMAP));
+
+	return ret;
 }
 
 static loff_t dma_buf_llseek(struct file *file, loff_t offset, int whence)
@@ -1244,6 +1249,8 @@  EXPORT_SYMBOL_GPL(dma_buf_end_cpu_access);
 int dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_struct *vma,
 		 unsigned long pgoff)
 {
+	int ret;
+
 	if (WARN_ON(!dmabuf || !vma))
 		return -EINVAL;
 
@@ -1264,7 +1271,11 @@  int dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_struct *vma,
 	vma_set_file(vma, dmabuf->file);
 	vma->vm_pgoff = pgoff;
 
-	return dmabuf->ops->mmap(dmabuf, vma);
+	ret = dmabuf->ops->mmap(dmabuf, vma);
+
+	WARN_ON(!(vma->vm_flags & VM_PFNMAP));
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(dma_buf_mmap);