dma-buf: Require VM_PFNMAP vma for mmap

Message ID	20221122170801.842766-1-daniel.vetter@ffwll.ch (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-media-owner@kernel.org> From: Daniel Vetter <daniel.vetter@ffwll.ch> To: DRI Development <dri-devel@lists.freedesktop.org> Cc: Intel Graphics Development <intel-gfx@lists.freedesktop.org>, Daniel Vetter <daniel.vetter@ffwll.ch>, =?utf-8?q?Christian_K=C3=B6nig?= <christian.koenig@amd.com>, Thomas Zimmermann <tzimmermann@suse.de>, Jason Gunthorpe <jgg@ziepe.ca>, Suren Baghdasaryan <surenb@google.com>, Matthew Wilcox <willy@infradead.org>, John Stultz <john.stultz@linaro.org>, Daniel Vetter <daniel.vetter@intel.com>, Sumit Semwal <sumit.semwal@linaro.org>, linux-media@vger.kernel.org, linaro-mm-sig@lists.linaro.org Subject: [PATCH] dma-buf: Require VM_PFNMAP vma for mmap Date: Tue, 22 Nov 2022 18:08:00 +0100 Message-Id: <20221122170801.842766-1-daniel.vetter@ffwll.ch> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	dma-buf: Require VM_PFNMAP vma for mmap \| expand dma-buf: Require VM_PFNMAP vma for mmap

Daniel Vetter Nov. 22, 2022, 5:08 p.m. UTC

tldr; DMA buffers aren't normal memory, expecting that you can use
them like that (like calling get_user_pages works, or that they're
accounting like any other normal memory) cannot be guaranteed.

Since some userspace only runs on integrated devices, where all
buffers are actually all resident system memory, there's a huge
temptation to assume that a struct page is always present and useable
like for any more pagecache backed mmap. This has the potential to
result in a uapi nightmare.

To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which
blocks get_user_pages and all the other struct page based
infrastructure for everyone. In spirit this is the uapi counterpart to
the kernel-internal CONFIG_DMABUF_DEBUG.

Motivated by a recent patch which wanted to swich the system dma-buf
heap to vm_insert_page instead of vm_insert_pfn.

v2:

Jason brought up that we also want to guarantee that all ptes have the
pte_special flag set, to catch fast get_user_pages (on architectures
that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would
still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.

From auditing the various functions to insert pfn pte entires
(vm_insert_pfn_prot, remap_pfn_range and all it's callers like
dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so
this should be the correct flag to check for.

v3: Change to WARN_ON_ONCE (Thomas Zimmermann)

References: https://lore.kernel.org/lkml/CAKMK7uHi+mG0z0HUmNt13QCCvutuRVjpcR0NjRL12k-WbWzkRg@mail.gmail.com/
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
--
Ok I entirely forgot about this patch but stumbled over it and checked
what's up with it no. I think it's ready now for merging:
- shmem helper patches to fix up vgem landed
- ttm has been fixed since a while
- I don't think we've had any other open issues

Time to lock down this uapi contract for real?
-Daniel
---
 drivers/dma-buf/dma-buf.c | 4 ++++
 1 file changed, 4 insertions(+)

Jason Gunthorpe Nov. 22, 2022, 6:03 p.m. UTC | #1

On Tue, Nov 22, 2022 at 06:08:00PM +0100, Daniel Vetter wrote:
> tldr; DMA buffers aren't normal memory, expecting that you can use
> them like that (like calling get_user_pages works, or that they're
> accounting like any other normal memory) cannot be guaranteed.
> 
> Since some userspace only runs on integrated devices, where all
> buffers are actually all resident system memory, there's a huge
> temptation to assume that a struct page is always present and useable
> like for any more pagecache backed mmap. This has the potential to
> result in a uapi nightmare.
> 
> To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which
> blocks get_user_pages and all the other struct page based
> infrastructure for everyone. In spirit this is the uapi counterpart to
> the kernel-internal CONFIG_DMABUF_DEBUG.
> 
> Motivated by a recent patch which wanted to swich the system dma-buf
> heap to vm_insert_page instead of vm_insert_pfn.
> 
> v2:
> 
> Jason brought up that we also want to guarantee that all ptes have the
> pte_special flag set, to catch fast get_user_pages (on architectures
> that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would
> still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.
> 
> From auditing the various functions to insert pfn pte entires
> (vm_insert_pfn_prot, remap_pfn_range and all it's callers like
> dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so
> this should be the correct flag to check for.

I didn't look at how this actually gets used, but it is a bit of a
pain to insert a lifetime controlled object like a struct page as a
special PTE/VM_PFNMAP

How is the lifetime model implemented here? How do you know when
userspace has finally unmapped the page?

Jason

Daniel Vetter Nov. 22, 2022, 6:08 p.m. UTC | #2

On Tue, 22 Nov 2022 at 19:04, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Nov 22, 2022 at 06:08:00PM +0100, Daniel Vetter wrote:
> > tldr; DMA buffers aren't normal memory, expecting that you can use
> > them like that (like calling get_user_pages works, or that they're
> > accounting like any other normal memory) cannot be guaranteed.
> >
> > Since some userspace only runs on integrated devices, where all
> > buffers are actually all resident system memory, there's a huge
> > temptation to assume that a struct page is always present and useable
> > like for any more pagecache backed mmap. This has the potential to
> > result in a uapi nightmare.
> >
> > To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which
> > blocks get_user_pages and all the other struct page based
> > infrastructure for everyone. In spirit this is the uapi counterpart to
> > the kernel-internal CONFIG_DMABUF_DEBUG.
> >
> > Motivated by a recent patch which wanted to swich the system dma-buf
> > heap to vm_insert_page instead of vm_insert_pfn.
> >
> > v2:
> >
> > Jason brought up that we also want to guarantee that all ptes have the
> > pte_special flag set, to catch fast get_user_pages (on architectures
> > that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would
> > still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.
> >
> > From auditing the various functions to insert pfn pte entires
> > (vm_insert_pfn_prot, remap_pfn_range and all it's callers like
> > dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so
> > this should be the correct flag to check for.
>
> I didn't look at how this actually gets used, but it is a bit of a
> pain to insert a lifetime controlled object like a struct page as a
> special PTE/VM_PFNMAP
>
> How is the lifetime model implemented here? How do you know when
> userspace has finally unmapped the page?

The vma has a filp which is the refcounted dma_buf. With dma_buf you
never get an individual page it's always the entire object. And it's
up to the allocator how exactly it wants to use or not use the page's
refcount. So if gup goes in and elevates the refcount, you can break
stuff, which is why I'm doing this.
-Daniel

Jason Gunthorpe Nov. 22, 2022, 6:50 p.m. UTC | #3

On Tue, Nov 22, 2022 at 07:08:25PM +0100, Daniel Vetter wrote:
> On Tue, 22 Nov 2022 at 19:04, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Tue, Nov 22, 2022 at 06:08:00PM +0100, Daniel Vetter wrote:
> > > tldr; DMA buffers aren't normal memory, expecting that you can use
> > > them like that (like calling get_user_pages works, or that they're
> > > accounting like any other normal memory) cannot be guaranteed.
> > >
> > > Since some userspace only runs on integrated devices, where all
> > > buffers are actually all resident system memory, there's a huge
> > > temptation to assume that a struct page is always present and useable
> > > like for any more pagecache backed mmap. This has the potential to
> > > result in a uapi nightmare.
> > >
> > > To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which
> > > blocks get_user_pages and all the other struct page based
> > > infrastructure for everyone. In spirit this is the uapi counterpart to
> > > the kernel-internal CONFIG_DMABUF_DEBUG.
> > >
> > > Motivated by a recent patch which wanted to swich the system dma-buf
> > > heap to vm_insert_page instead of vm_insert_pfn.
> > >
> > > v2:
> > >
> > > Jason brought up that we also want to guarantee that all ptes have the
> > > pte_special flag set, to catch fast get_user_pages (on architectures
> > > that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would
> > > still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.
> > >
> > > From auditing the various functions to insert pfn pte entires
> > > (vm_insert_pfn_prot, remap_pfn_range and all it's callers like
> > > dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so
> > > this should be the correct flag to check for.
> >
> > I didn't look at how this actually gets used, but it is a bit of a
> > pain to insert a lifetime controlled object like a struct page as a
> > special PTE/VM_PFNMAP
> >
> > How is the lifetime model implemented here? How do you know when
> > userspace has finally unmapped the page?
> 
> The vma has a filp which is the refcounted dma_buf. With dma_buf you
> never get an individual page it's always the entire object. And it's
> up to the allocator how exactly it wants to use or not use the page's
> refcount. So if gup goes in and elevates the refcount, you can break
> stuff, which is why I'm doing this.

But how does move work?

Jason

Daniel Vetter Nov. 22, 2022, 7:29 p.m. UTC | #4

On Tue, 22 Nov 2022 at 19:50, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Nov 22, 2022 at 07:08:25PM +0100, Daniel Vetter wrote:
> > On Tue, 22 Nov 2022 at 19:04, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Tue, Nov 22, 2022 at 06:08:00PM +0100, Daniel Vetter wrote:
> > > > tldr; DMA buffers aren't normal memory, expecting that you can use
> > > > them like that (like calling get_user_pages works, or that they're
> > > > accounting like any other normal memory) cannot be guaranteed.
> > > >
> > > > Since some userspace only runs on integrated devices, where all
> > > > buffers are actually all resident system memory, there's a huge
> > > > temptation to assume that a struct page is always present and useable
> > > > like for any more pagecache backed mmap. This has the potential to
> > > > result in a uapi nightmare.
> > > >
> > > > To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which
> > > > blocks get_user_pages and all the other struct page based
> > > > infrastructure for everyone. In spirit this is the uapi counterpart to
> > > > the kernel-internal CONFIG_DMABUF_DEBUG.
> > > >
> > > > Motivated by a recent patch which wanted to swich the system dma-buf
> > > > heap to vm_insert_page instead of vm_insert_pfn.
> > > >
> > > > v2:
> > > >
> > > > Jason brought up that we also want to guarantee that all ptes have the
> > > > pte_special flag set, to catch fast get_user_pages (on architectures
> > > > that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would
> > > > still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.
> > > >
> > > > From auditing the various functions to insert pfn pte entires
> > > > (vm_insert_pfn_prot, remap_pfn_range and all it's callers like
> > > > dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so
> > > > this should be the correct flag to check for.
> > >
> > > I didn't look at how this actually gets used, but it is a bit of a
> > > pain to insert a lifetime controlled object like a struct page as a
> > > special PTE/VM_PFNMAP
> > >
> > > How is the lifetime model implemented here? How do you know when
> > > userspace has finally unmapped the page?
> >
> > The vma has a filp which is the refcounted dma_buf. With dma_buf you
> > never get an individual page it's always the entire object. And it's
> > up to the allocator how exactly it wants to use or not use the page's
> > refcount. So if gup goes in and elevates the refcount, you can break
> > stuff, which is why I'm doing this.
>
> But how does move work?

You nuke all the ptes. Drivers that move have slightly more than a
bare struct file, they also have a struct address_space so that
invalidate_mapping_range() works. Refaulting and any coherency issues
when a refault races against a dma-buf migration is up to the
driver/exporter to handle correctly. None rely on struct page like mm/
moving stuff around for compaction/ksm/numa-balancing/whateverr.
-Daniel

Jason Gunthorpe Nov. 22, 2022, 7:34 p.m. UTC | #5

On Tue, Nov 22, 2022 at 08:29:05PM +0100, Daniel Vetter wrote:

> You nuke all the ptes. Drivers that move have slightly more than a
> bare struct file, they also have a struct address_space so that
> invalidate_mapping_range() works.

Okay, this is one of the ways that this can be made to work correctly,
as long as you never allow GUP/GUP_fast to succeed on the PTEs. (this
was the DAX mistake)

Jason

Daniel Vetter Nov. 22, 2022, 7:50 p.m. UTC | #6

On Tue, 22 Nov 2022 at 20:34, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Tue, Nov 22, 2022 at 08:29:05PM +0100, Daniel Vetter wrote:
> > You nuke all the ptes. Drivers that move have slightly more than a
> > bare struct file, they also have a struct address_space so that
> > invalidate_mapping_range() works.
>
> Okay, this is one of the ways that this can be made to work correctly,
> as long as you never allow GUP/GUP_fast to succeed on the PTEs. (this
> was the DAX mistake)

Hence this patch, to enforce that no dma-buf exporter gets this wrong.
Which some did, and then blamed bug reporters for the resulting splats
:-) One of the things we've reverted was the ttm huge pte support,
since that doesn't have the pmd_special flag (yet) and so would let
gup_fast through.
-Daniel

Thomas Zimmermann Nov. 23, 2022, 8:07 a.m. UTC | #7

Hi

Am 22.11.22 um 18:08 schrieb Daniel Vetter:
> tldr; DMA buffers aren't normal memory, expecting that you can use
> them like that (like calling get_user_pages works, or that they're
> accounting like any other normal memory) cannot be guaranteed.
> 
> Since some userspace only runs on integrated devices, where all
> buffers are actually all resident system memory, there's a huge
> temptation to assume that a struct page is always present and useable
> like for any more pagecache backed mmap. This has the potential to
> result in a uapi nightmare.
> 
> To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which
> blocks get_user_pages and all the other struct page based
> infrastructure for everyone. In spirit this is the uapi counterpart to
> the kernel-internal CONFIG_DMABUF_DEBUG.
> 
> Motivated by a recent patch which wanted to swich the system dma-buf
> heap to vm_insert_page instead of vm_insert_pfn.
> 
> v2:
> 
> Jason brought up that we also want to guarantee that all ptes have the
> pte_special flag set, to catch fast get_user_pages (on architectures
> that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would
> still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.
> 
>  From auditing the various functions to insert pfn pte entires
> (vm_insert_pfn_prot, remap_pfn_range and all it's callers like
> dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so
> this should be the correct flag to check for.
> 
> v3: Change to WARN_ON_ONCE (Thomas Zimmermann)
> 
> References: https://lore.kernel.org/lkml/CAKMK7uHi+mG0z0HUmNt13QCCvutuRVjpcR0NjRL12k-WbWzkRg@mail.gmail.com/
> Acked-by: Christian König <christian.koenig@amd.com>
> Acked-by: Thomas Zimmermann <tzimmermann@suse.de>
> Cc: Thomas Zimmermann <tzimmermann@suse.de>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: John Stultz <john.stultz@linaro.org>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Sumit Semwal <sumit.semwal@linaro.org>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> --
> Ok I entirely forgot about this patch but stumbled over it and checked
> what's up with it no. I think it's ready now for merging:
> - shmem helper patches to fix up vgem landed
> - ttm has been fixed since a while
> - I don't think we've had any other open issues
> 
> Time to lock down this uapi contract for real?
> -Daniel
> ---
>   drivers/dma-buf/dma-buf.c | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index b6c36914e7c6..88718665c3c3 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -150,6 +150,8 @@ static int dma_buf_mmap_internal(struct file *file, struct vm_area_struct *vma)
>   	ret = dmabuf->ops->mmap(dmabuf, vma);
>   	dma_resv_unlock(dmabuf->resv);
>   
> +	WARN_ON_ONCE(!(vma->vm_flags & VM_PFNMAP));

Well , I already a-b'ed this, but does it work with DMa helpers. I'm 
asking because of [1].

Best regards
Thomas

[1] 
https://elixir.bootlin.com/linux/v6.1-rc6/source/drivers/gpu/drm/drm_gem_dma_helper.c#L533

> +
>   	return ret;
>   }
>   
> @@ -1495,6 +1497,8 @@ int dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_struct *vma,
>   	ret = dmabuf->ops->mmap(dmabuf, vma);
>   	dma_resv_unlock(dmabuf->resv);
>   
> +	WARN_ON_ONCE(!(vma->vm_flags & VM_PFNMAP));
> +
>   	return ret;
>   }
>   EXPORT_SYMBOL_NS_GPL(dma_buf_mmap, DMA_BUF);

Christian König Nov. 23, 2022, 9:06 a.m. UTC | #8

Am 22.11.22 um 20:50 schrieb Daniel Vetter:
> On Tue, 22 Nov 2022 at 20:34, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>> On Tue, Nov 22, 2022 at 08:29:05PM +0100, Daniel Vetter wrote:
>>> You nuke all the ptes. Drivers that move have slightly more than a
>>> bare struct file, they also have a struct address_space so that
>>> invalidate_mapping_range() works.
>> Okay, this is one of the ways that this can be made to work correctly,
>> as long as you never allow GUP/GUP_fast to succeed on the PTEs. (this
>> was the DAX mistake)
> Hence this patch, to enforce that no dma-buf exporter gets this wrong.
> Which some did, and then blamed bug reporters for the resulting splats
> :-) One of the things we've reverted was the ttm huge pte support,
> since that doesn't have the pmd_special flag (yet) and so would let
> gup_fast through.

The problem is not only gup, a lot of people seem to assume that when 
you are able to grab a reference to a page that the ptes pointing to 
that page can't change any more. And that's obviously incorrect.

I witnessed tons of discussions about that already. Some customers even 
modified our code assuming that and then wondered why the heck they ran 
into data corruption.

It's gotten so bad that I've even proposed intentionally mangling the 
page reference count on TTM allocated pages: 
https://patchwork.kernel.org/project/dri-devel/patch/20220927143529.135689-1-christian.koenig@amd.com/

I think it would be better that instead of having special flags in the 
ptes and vmas that you can't follow them to a page structure we would 
add something to the page indicating that you can't grab a reference to 
it. But this might break some use cases as well.

Regards,
Christian.

> -Daniel

Daniel Vetter Nov. 23, 2022, 9:30 a.m. UTC | #9

On Wed, 23 Nov 2022 at 10:06, Christian König <christian.koenig@amd.com> wrote:
> Am 22.11.22 um 20:50 schrieb Daniel Vetter:
> > On Tue, 22 Nov 2022 at 20:34, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >> On Tue, Nov 22, 2022 at 08:29:05PM +0100, Daniel Vetter wrote:
> >>> You nuke all the ptes. Drivers that move have slightly more than a
> >>> bare struct file, they also have a struct address_space so that
> >>> invalidate_mapping_range() works.
> >> Okay, this is one of the ways that this can be made to work correctly,
> >> as long as you never allow GUP/GUP_fast to succeed on the PTEs. (this
> >> was the DAX mistake)
> > Hence this patch, to enforce that no dma-buf exporter gets this wrong.
> > Which some did, and then blamed bug reporters for the resulting splats
> > :-) One of the things we've reverted was the ttm huge pte support,
> > since that doesn't have the pmd_special flag (yet) and so would let
> > gup_fast through.
>
> The problem is not only gup, a lot of people seem to assume that when
> you are able to grab a reference to a page that the ptes pointing to
> that page can't change any more. And that's obviously incorrect.
>
> I witnessed tons of discussions about that already. Some customers even
> modified our code assuming that and then wondered why the heck they ran
> into data corruption.
>
> It's gotten so bad that I've even proposed intentionally mangling the
> page reference count on TTM allocated pages:
> https://patchwork.kernel.org/project/dri-devel/patch/20220927143529.135689-1-christian.koenig@amd.com/

Yeah maybe something like this could be applied after we land this
patch here. Well maybe should have the same check in gem mmap code to
make sure no driver

> I think it would be better that instead of having special flags in the
> ptes and vmas that you can't follow them to a page structure we would
> add something to the page indicating that you can't grab a reference to
> it. But this might break some use cases as well.

Afaik the problem with that is that there's no free page bits left for
these debug checks. Plus the pte+vma flags are the flags to make this
clear already.
-Daniel

Daniel Vetter Nov. 23, 2022, 9:33 a.m. UTC | #10

On Wed, 23 Nov 2022 at 09:07, Thomas Zimmermann <tzimmermann@suse.de> wrote:
>
> Hi
>
> Am 22.11.22 um 18:08 schrieb Daniel Vetter:
> > tldr; DMA buffers aren't normal memory, expecting that you can use
> > them like that (like calling get_user_pages works, or that they're
> > accounting like any other normal memory) cannot be guaranteed.
> >
> > Since some userspace only runs on integrated devices, where all
> > buffers are actually all resident system memory, there's a huge
> > temptation to assume that a struct page is always present and useable
> > like for any more pagecache backed mmap. This has the potential to
> > result in a uapi nightmare.
> >
> > To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which
> > blocks get_user_pages and all the other struct page based
> > infrastructure for everyone. In spirit this is the uapi counterpart to
> > the kernel-internal CONFIG_DMABUF_DEBUG.
> >
> > Motivated by a recent patch which wanted to swich the system dma-buf
> > heap to vm_insert_page instead of vm_insert_pfn.
> >
> > v2:
> >
> > Jason brought up that we also want to guarantee that all ptes have the
> > pte_special flag set, to catch fast get_user_pages (on architectures
> > that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would
> > still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.
> >
> >  From auditing the various functions to insert pfn pte entires
> > (vm_insert_pfn_prot, remap_pfn_range and all it's callers like
> > dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so
> > this should be the correct flag to check for.
> >
> > v3: Change to WARN_ON_ONCE (Thomas Zimmermann)
> >
> > References: https://lore.kernel.org/lkml/CAKMK7uHi+mG0z0HUmNt13QCCvutuRVjpcR0NjRL12k-WbWzkRg@mail.gmail.com/
> > Acked-by: Christian König <christian.koenig@amd.com>
> > Acked-by: Thomas Zimmermann <tzimmermann@suse.de>
> > Cc: Thomas Zimmermann <tzimmermann@suse.de>
> > Cc: Jason Gunthorpe <jgg@ziepe.ca>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Matthew Wilcox <willy@infradead.org>
> > Cc: John Stultz <john.stultz@linaro.org>
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > Cc: Sumit Semwal <sumit.semwal@linaro.org>
> > Cc: "Christian König" <christian.koenig@amd.com>
> > Cc: linux-media@vger.kernel.org
> > Cc: linaro-mm-sig@lists.linaro.org
> > --
> > Ok I entirely forgot about this patch but stumbled over it and checked
> > what's up with it no. I think it's ready now for merging:
> > - shmem helper patches to fix up vgem landed
> > - ttm has been fixed since a while
> > - I don't think we've had any other open issues
> >
> > Time to lock down this uapi contract for real?
> > -Daniel
> > ---
> >   drivers/dma-buf/dma-buf.c | 4 ++++
> >   1 file changed, 4 insertions(+)
> >
> > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > index b6c36914e7c6..88718665c3c3 100644
> > --- a/drivers/dma-buf/dma-buf.c
> > +++ b/drivers/dma-buf/dma-buf.c
> > @@ -150,6 +150,8 @@ static int dma_buf_mmap_internal(struct file *file, struct vm_area_struct *vma)
> >       ret = dmabuf->ops->mmap(dmabuf, vma);
> >       dma_resv_unlock(dmabuf->resv);
> >
> > +     WARN_ON_ONCE(!(vma->vm_flags & VM_PFNMAP));
>
> Well , I already a-b'ed this, but does it work with DMa helpers. I'm
> asking because of [1].
>
> Best regards
> Thomas
>
> [1]
> https://elixir.bootlin.com/linux/v6.1-rc6/source/drivers/gpu/drm/drm_gem_dma_helper.c#L533

This one is entertaining, but also doesn't matter, because the
remap_pfn_range that the various dma_mmap functions boil down to sets
the VM_PFNMAP and a pile of other flags. See

https://elixir.bootlin.com/linux/v6.1-rc6/source/mm/memory.c#L2518

I have no idea why Laurent cleared this flag here just so it gets
reset again a bit later when he added that code. Laurent?
-Daniel

> > +
> >       return ret;
> >   }
> >
> > @@ -1495,6 +1497,8 @@ int dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_struct *vma,
> >       ret = dmabuf->ops->mmap(dmabuf, vma);
> >       dma_resv_unlock(dmabuf->resv);
> >
> > +     WARN_ON_ONCE(!(vma->vm_flags & VM_PFNMAP));
> > +
> >       return ret;
> >   }
> >   EXPORT_SYMBOL_NS_GPL(dma_buf_mmap, DMA_BUF);
>
> --
> Thomas Zimmermann
> Graphics Driver Developer
> SUSE Software Solutions Germany GmbH
> Maxfeldstr. 5, 90409 Nürnberg, Germany
> (HRB 36809, AG Nürnberg)
> Geschäftsführer: Ivo Totev

Christian König Nov. 23, 2022, 9:39 a.m. UTC | #11

Am 23.11.22 um 10:30 schrieb Daniel Vetter:
> On Wed, 23 Nov 2022 at 10:06, Christian König <christian.koenig@amd.com> wrote:
>> Am 22.11.22 um 20:50 schrieb Daniel Vetter:
>>> On Tue, 22 Nov 2022 at 20:34, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>>> On Tue, Nov 22, 2022 at 08:29:05PM +0100, Daniel Vetter wrote:
>>>>> You nuke all the ptes. Drivers that move have slightly more than a
>>>>> bare struct file, they also have a struct address_space so that
>>>>> invalidate_mapping_range() works.
>>>> Okay, this is one of the ways that this can be made to work correctly,
>>>> as long as you never allow GUP/GUP_fast to succeed on the PTEs. (this
>>>> was the DAX mistake)
>>> Hence this patch, to enforce that no dma-buf exporter gets this wrong.
>>> Which some did, and then blamed bug reporters for the resulting splats
>>> :-) One of the things we've reverted was the ttm huge pte support,
>>> since that doesn't have the pmd_special flag (yet) and so would let
>>> gup_fast through.
>> The problem is not only gup, a lot of people seem to assume that when
>> you are able to grab a reference to a page that the ptes pointing to
>> that page can't change any more. And that's obviously incorrect.
>>
>> I witnessed tons of discussions about that already. Some customers even
>> modified our code assuming that and then wondered why the heck they ran
>> into data corruption.
>>
>> It's gotten so bad that I've even proposed intentionally mangling the
>> page reference count on TTM allocated pages:
>> https://patchwork.kernel.org/project/dri-devel/patch/20220927143529.135689-1-christian.koenig@amd.com/
> Yeah maybe something like this could be applied after we land this
> patch here.

I think both should land ASAP. We don't have any other way than to 
clearly point out incorrect approaches if we want to prevent the 
resulting data corruption.

> Well maybe should have the same check in gem mmap code to
> make sure no driver

Reads like the sentence is somehow cut of?

>
>> I think it would be better that instead of having special flags in the
>> ptes and vmas that you can't follow them to a page structure we would
>> add something to the page indicating that you can't grab a reference to
>> it. But this might break some use cases as well.
> Afaik the problem with that is that there's no free page bits left for
> these debug checks. Plus the pte+vma flags are the flags to make this
> clear already.

Maybe a GFP flag to set the page reference count to zero or something 
like this?

Christian.

> -Daniel

Daniel Vetter Nov. 23, 2022, 10:06 a.m. UTC | #12

On Wed, 23 Nov 2022 at 10:39, Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 23.11.22 um 10:30 schrieb Daniel Vetter:
> > On Wed, 23 Nov 2022 at 10:06, Christian König <christian.koenig@amd.com> wrote:
> >> Am 22.11.22 um 20:50 schrieb Daniel Vetter:
> >>> On Tue, 22 Nov 2022 at 20:34, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >>>> On Tue, Nov 22, 2022 at 08:29:05PM +0100, Daniel Vetter wrote:
> >>>>> You nuke all the ptes. Drivers that move have slightly more than a
> >>>>> bare struct file, they also have a struct address_space so that
> >>>>> invalidate_mapping_range() works.
> >>>> Okay, this is one of the ways that this can be made to work correctly,
> >>>> as long as you never allow GUP/GUP_fast to succeed on the PTEs. (this
> >>>> was the DAX mistake)
> >>> Hence this patch, to enforce that no dma-buf exporter gets this wrong.
> >>> Which some did, and then blamed bug reporters for the resulting splats
> >>> :-) One of the things we've reverted was the ttm huge pte support,
> >>> since that doesn't have the pmd_special flag (yet) and so would let
> >>> gup_fast through.
> >> The problem is not only gup, a lot of people seem to assume that when
> >> you are able to grab a reference to a page that the ptes pointing to
> >> that page can't change any more. And that's obviously incorrect.
> >>
> >> I witnessed tons of discussions about that already. Some customers even
> >> modified our code assuming that and then wondered why the heck they ran
> >> into data corruption.
> >>
> >> It's gotten so bad that I've even proposed intentionally mangling the
> >> page reference count on TTM allocated pages:
> >> https://patchwork.kernel.org/project/dri-devel/patch/20220927143529.135689-1-christian.koenig@amd.com/
> > Yeah maybe something like this could be applied after we land this
> > patch here.
>
> I think both should land ASAP. We don't have any other way than to
> clearly point out incorrect approaches if we want to prevent the
> resulting data corruption.
>
> > Well maybe should have the same check in gem mmap code to
> > make sure no driver
>
> Reads like the sentence is somehow cut of?

Yeah, just wanted to say that we need to make sure in as many places
as possible that VM_PFNMAP is set for bo mmaps.

> >> I think it would be better that instead of having special flags in the
> >> ptes and vmas that you can't follow them to a page structure we would
> >> add something to the page indicating that you can't grab a reference to
> >> it. But this might break some use cases as well.
> > Afaik the problem with that is that there's no free page bits left for
> > these debug checks. Plus the pte+vma flags are the flags to make this
> > clear already.
>
> Maybe a GFP flag to set the page reference count to zero or something
> like this?

Hm yeah that might work. I'm not sure what it will all break though?
And we'd need to make sure that underflowing the page refcount dies in
a backtrace.
-Daniel

Jason Gunthorpe Nov. 23, 2022, 12:46 p.m. UTC | #13

On Wed, Nov 23, 2022 at 11:06:55AM +0100, Daniel Vetter wrote:

> > Maybe a GFP flag to set the page reference count to zero or something
> > like this?
> 
> Hm yeah that might work. I'm not sure what it will all break though?
> And we'd need to make sure that underflowing the page refcount dies in
> a backtrace.

Mucking with the refcount like this to protect against crazy out of
tree drives seems horrible..

The WARN_ON(pag_count(p) != 1) seems like a reasonable thing to do
though, though you must combine this with the special PTE flag..

Jason

Christian König Nov. 23, 2022, 12:49 p.m. UTC | #14

Am 23.11.22 um 13:46 schrieb Jason Gunthorpe:
> On Wed, Nov 23, 2022 at 11:06:55AM +0100, Daniel Vetter wrote:
>
>>> Maybe a GFP flag to set the page reference count to zero or something
>>> like this?
>> Hm yeah that might work. I'm not sure what it will all break though?
>> And we'd need to make sure that underflowing the page refcount dies in
>> a backtrace.
> Mucking with the refcount like this to protect against crazy out of
> tree drives seems horrible..

Well not only out of tree drivers. The intree KVM got that horrible 
wrong as well, those where the latest guys complaining about it.

>
> The WARN_ON(pag_count(p) != 1) seems like a reasonable thing to do
> though, though you must combine this with the special PTE flag..

That's not sufficient. The pages are released much later than things 
actually go wrong. In most cases this WARN_ON here won't hit.

Christian.

>
> Jason

Jason Gunthorpe Nov. 23, 2022, 12:53 p.m. UTC | #15

On Wed, Nov 23, 2022 at 01:49:41PM +0100, Christian König wrote:
> Am 23.11.22 um 13:46 schrieb Jason Gunthorpe:
> > On Wed, Nov 23, 2022 at 11:06:55AM +0100, Daniel Vetter wrote:
> > 
> > > > Maybe a GFP flag to set the page reference count to zero or something
> > > > like this?
> > > Hm yeah that might work. I'm not sure what it will all break though?
> > > And we'd need to make sure that underflowing the page refcount dies in
> > > a backtrace.
> > Mucking with the refcount like this to protect against crazy out of
> > tree drives seems horrible..
> 
> Well not only out of tree drivers. The intree KVM got that horrible
> wrong as well, those where the latest guys complaining about it.

kvm was taking refs on special PTEs? That seems really unlikely?

> > The WARN_ON(pag_count(p) != 1) seems like a reasonable thing to do
> > though, though you must combine this with the special PTE flag..
> 
> That's not sufficient. The pages are released much later than things
> actually go wrong. In most cases this WARN_ON here won't hit.

How so? As long as the page is mapped into the PTE there is no issue
with corruption. If dmabuf checks the refcount after it does the unmap
mapping range it should catch any bogus pin that might be confused
about address coherency.

Jason

Christian König Nov. 23, 2022, 1:12 p.m. UTC | #16

Am 23.11.22 um 13:53 schrieb Jason Gunthorpe:
> On Wed, Nov 23, 2022 at 01:49:41PM +0100, Christian König wrote:
>> Am 23.11.22 um 13:46 schrieb Jason Gunthorpe:
>>> On Wed, Nov 23, 2022 at 11:06:55AM +0100, Daniel Vetter wrote:
>>>
>>>>> Maybe a GFP flag to set the page reference count to zero or something
>>>>> like this?
>>>> Hm yeah that might work. I'm not sure what it will all break though?
>>>> And we'd need to make sure that underflowing the page refcount dies in
>>>> a backtrace.
>>> Mucking with the refcount like this to protect against crazy out of
>>> tree drives seems horrible..
>> Well not only out of tree drivers. The intree KVM got that horrible
>> wrong as well, those where the latest guys complaining about it.
> kvm was taking refs on special PTEs? That seems really unlikely?

Well then look at this code here:

commit add6a0cd1c5ba51b201e1361b05a5df817083618
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Tue Jun 7 17:51:18 2016 +0200

     KVM: MMU: try to fix up page faults before giving up

     The vGPU folks would like to trap the first access to a BAR by setting
     vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault 
handler
     then can use remap_pfn_range to place some non-reserved pages in 
the VMA.

     This kind of VM_PFNMAP mapping is not handled by KVM, but follow_pfn
     and fixup_user_fault together help supporting it.  The patch also 
supports
     VM_MIXEDMAP vmas where the pfns are not reserved and thus subject to
     reference counting.

     Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
     Cc: Andrea Arcangeli <aarcange@redhat.com>
     Cc: Radim Krčmář <rkrcmar@redhat.com>
     Tested-by: Neo Jia <cjia@nvidia.com>
     Reported-by: Kirti Wankhede <kwankhede@nvidia.com>
     Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

And see also the discussion here: 
https://patchwork.freedesktop.org/patch/414123/

as well as here: https://patchwork.freedesktop.org/patch/499190/

I can't count how often I have pointed out that this is absolutely 
illegal and KVM can't touch pages in VMAs with VM_PFNMAP.

>>> The WARN_ON(pag_count(p) != 1) seems like a reasonable thing to do
>>> though, though you must combine this with the special PTE flag..
>> That's not sufficient. The pages are released much later than things
>> actually go wrong. In most cases this WARN_ON here won't hit.
> How so? As long as the page is mapped into the PTE there is no issue
> with corruption. If dmabuf checks the refcount after it does the unmap
> mapping range it should catch any bogus pin that might be confused
> about address coherency.

Yeah, that would work. The problem is this WARN_ON() comes much later.

The device drivers usually keep the page around for a while even after 
it is unmapped. IIRC the cleanup worker only runs every 10ms or so.

Christian.

>
> Jason

Jason Gunthorpe Nov. 23, 2022, 1:28 p.m. UTC | #17

On Wed, Nov 23, 2022 at 02:12:25PM +0100, Christian König wrote:
> Am 23.11.22 um 13:53 schrieb Jason Gunthorpe:
> > On Wed, Nov 23, 2022 at 01:49:41PM +0100, Christian König wrote:
> > > Am 23.11.22 um 13:46 schrieb Jason Gunthorpe:
> > > > On Wed, Nov 23, 2022 at 11:06:55AM +0100, Daniel Vetter wrote:
> > > > 
> > > > > > Maybe a GFP flag to set the page reference count to zero or something
> > > > > > like this?
> > > > > Hm yeah that might work. I'm not sure what it will all break though?
> > > > > And we'd need to make sure that underflowing the page refcount dies in
> > > > > a backtrace.
> > > > Mucking with the refcount like this to protect against crazy out of
> > > > tree drives seems horrible..
> > > Well not only out of tree drivers. The intree KVM got that horrible
> > > wrong as well, those where the latest guys complaining about it.
> > kvm was taking refs on special PTEs? That seems really unlikely?
> 
> Well then look at this code here:
> 
> commit add6a0cd1c5ba51b201e1361b05a5df817083618
> Author: Paolo Bonzini <pbonzini@redhat.com>
> Date:   Tue Jun 7 17:51:18 2016 +0200
> 
>     KVM: MMU: try to fix up page faults before giving up
> 
>     The vGPU folks would like to trap the first access to a BAR by setting
>     vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault
> handler
>     then can use remap_pfn_range to place some non-reserved pages in the
> VMA.
> 
>     This kind of VM_PFNMAP mapping is not handled by KVM, but follow_pfn
>     and fixup_user_fault together help supporting it.  The patch also
> supports
>     VM_MIXEDMAP vmas where the pfns are not reserved and thus subject to
>     reference counting.
> 
>     Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
>     Cc: Andrea Arcangeli <aarcange@redhat.com>
>     Cc: Radim Krčmář <rkrcmar@redhat.com>
>     Tested-by: Neo Jia <cjia@nvidia.com>
>     Reported-by: Kirti Wankhede <kwankhede@nvidia.com>
>     Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

This patch is known to be broken in so many ways. It also has a major
security hole that it ignores the PTE flags making the page
RO. Ignoring the special bit is somehow not surprising :(

This probably doesn't work, but is the general idea of what KVM needs
to do:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1376a47fedeedb..4161241fc3228c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2598,6 +2598,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 			return r;
 	}
 
+	/*
+	 * Special PTEs are never convertible into a struct page, even if the
+	 * driver that owns them might have put a PFN with a struct page into
+	 * the PFNMAP. If the arch doesn't support special then we cannot
+	 * safely process these pages.
+	 */
+#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
+	if (pte_special(*ptep))
+		return -EINVAL;
+#else
+	return -EINVAL;
+#endif
+
 	if (write_fault && !pte_write(*ptep)) {
 		pfn = KVM_PFN_ERR_RO_FAULT;
 		goto out;

Jason

Daniel Vetter Nov. 23, 2022, 2:28 p.m. UTC | #18

On Wed, 23 Nov 2022 at 14:28, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Nov 23, 2022 at 02:12:25PM +0100, Christian König wrote:
> > Am 23.11.22 um 13:53 schrieb Jason Gunthorpe:
> > > On Wed, Nov 23, 2022 at 01:49:41PM +0100, Christian König wrote:
> > > > Am 23.11.22 um 13:46 schrieb Jason Gunthorpe:
> > > > > On Wed, Nov 23, 2022 at 11:06:55AM +0100, Daniel Vetter wrote:
> > > > >
> > > > > > > Maybe a GFP flag to set the page reference count to zero or something
> > > > > > > like this?
> > > > > > Hm yeah that might work. I'm not sure what it will all break though?
> > > > > > And we'd need to make sure that underflowing the page refcount dies in
> > > > > > a backtrace.
> > > > > Mucking with the refcount like this to protect against crazy out of
> > > > > tree drives seems horrible..
> > > > Well not only out of tree drivers. The intree KVM got that horrible
> > > > wrong as well, those where the latest guys complaining about it.
> > > kvm was taking refs on special PTEs? That seems really unlikely?
> >
> > Well then look at this code here:
> >
> > commit add6a0cd1c5ba51b201e1361b05a5df817083618
> > Author: Paolo Bonzini <pbonzini@redhat.com>
> > Date:   Tue Jun 7 17:51:18 2016 +0200
> >
> >     KVM: MMU: try to fix up page faults before giving up
> >
> >     The vGPU folks would like to trap the first access to a BAR by setting
> >     vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault
> > handler
> >     then can use remap_pfn_range to place some non-reserved pages in the
> > VMA.
> >
> >     This kind of VM_PFNMAP mapping is not handled by KVM, but follow_pfn
> >     and fixup_user_fault together help supporting it.  The patch also
> > supports
> >     VM_MIXEDMAP vmas where the pfns are not reserved and thus subject to
> >     reference counting.
> >
> >     Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
> >     Cc: Andrea Arcangeli <aarcange@redhat.com>
> >     Cc: Radim Krčmář <rkrcmar@redhat.com>
> >     Tested-by: Neo Jia <cjia@nvidia.com>
> >     Reported-by: Kirti Wankhede <kwankhede@nvidia.com>
> >     Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>
> This patch is known to be broken in so many ways. It also has a major
> security hole that it ignores the PTE flags making the page
> RO. Ignoring the special bit is somehow not surprising :(
>
> This probably doesn't work, but is the general idea of what KVM needs
> to do:

Oh dear, when I dug around in there I entirely missed that
kvm_try_get_pfn exists, and it's very broken indeed. kvm really needs
to grow a proper mmu notifier.

Another thing I'm wondering right now, the follow_pte();
fixup_user_fault(); follow_pte(); approach does not make any
guarantees of actually being right. If you're sufficiently unlucky you
might race against an immediate pte invalidate between the fixup and
the 2nd follow_pte(). But you can also not loop, because that would
fail to catch permanent faults.

I think the iommu fault drivers have a similar pattern.

What am I missing here? Or is that also just broken. gup works around
this with the slow path that takes the mmap sem and walking the vma
tree, follow_pte/fixup_user_fautl users dont. Maybe mmu notifier based
restarting would help with this too, if done properly.
-Daniel

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1376a47fedeedb..4161241fc3228c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2598,6 +2598,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>                         return r;
>         }
>
> +       /*
> +        * Special PTEs are never convertible into a struct page, even if the
> +        * driver that owns them might have put a PFN with a struct page into
> +        * the PFNMAP. If the arch doesn't support special then we cannot
> +        * safely process these pages.
> +        */
> +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> +       if (pte_special(*ptep))
> +               return -EINVAL;
> +#else
> +       return -EINVAL;
> +#endif
> +
>         if (write_fault && !pte_write(*ptep)) {
>                 pfn = KVM_PFN_ERR_RO_FAULT;
>                 goto out;
>
> Jason

Daniel Vetter Nov. 23, 2022, 2:34 p.m. UTC | #19

On Wed, 23 Nov 2022 at 14:28, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Nov 23, 2022 at 02:12:25PM +0100, Christian König wrote:
> > Am 23.11.22 um 13:53 schrieb Jason Gunthorpe:
> > > On Wed, Nov 23, 2022 at 01:49:41PM +0100, Christian König wrote:
> > > > Am 23.11.22 um 13:46 schrieb Jason Gunthorpe:
> > > > > On Wed, Nov 23, 2022 at 11:06:55AM +0100, Daniel Vetter wrote:
> > > > >
> > > > > > > Maybe a GFP flag to set the page reference count to zero or something
> > > > > > > like this?
> > > > > > Hm yeah that might work. I'm not sure what it will all break though?
> > > > > > And we'd need to make sure that underflowing the page refcount dies in
> > > > > > a backtrace.
> > > > > Mucking with the refcount like this to protect against crazy out of
> > > > > tree drives seems horrible..
> > > > Well not only out of tree drivers. The intree KVM got that horrible
> > > > wrong as well, those where the latest guys complaining about it.
> > > kvm was taking refs on special PTEs? That seems really unlikely?
> >
> > Well then look at this code here:
> >
> > commit add6a0cd1c5ba51b201e1361b05a5df817083618
> > Author: Paolo Bonzini <pbonzini@redhat.com>
> > Date:   Tue Jun 7 17:51:18 2016 +0200
> >
> >     KVM: MMU: try to fix up page faults before giving up
> >
> >     The vGPU folks would like to trap the first access to a BAR by setting
> >     vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault
> > handler
> >     then can use remap_pfn_range to place some non-reserved pages in the
> > VMA.
> >
> >     This kind of VM_PFNMAP mapping is not handled by KVM, but follow_pfn
> >     and fixup_user_fault together help supporting it.  The patch also
> > supports
> >     VM_MIXEDMAP vmas where the pfns are not reserved and thus subject to
> >     reference counting.
> >
> >     Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
> >     Cc: Andrea Arcangeli <aarcange@redhat.com>
> >     Cc: Radim Krčmář <rkrcmar@redhat.com>
> >     Tested-by: Neo Jia <cjia@nvidia.com>
> >     Reported-by: Kirti Wankhede <kwankhede@nvidia.com>
> >     Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>
> This patch is known to be broken in so many ways. It also has a major
> security hole that it ignores the PTE flags making the page
> RO. Ignoring the special bit is somehow not surprising :(
>
> This probably doesn't work, but is the general idea of what KVM needs
> to do:
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1376a47fedeedb..4161241fc3228c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2598,6 +2598,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>                         return r;
>         }
>
> +       /*
> +        * Special PTEs are never convertible into a struct page, even if the
> +        * driver that owns them might have put a PFN with a struct page into
> +        * the PFNMAP. If the arch doesn't support special then we cannot
> +        * safely process these pages.
> +        */
> +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> +       if (pte_special(*ptep))
> +               return -EINVAL;

On second thought this wont work, because it completely defeats the
point of why this code here exists. remap_pfn_range() (which is what
the various dma_mmap functions and the ioremap functions are built on
top of too) sets VM_PFNMAP too, so this check would even catch the
static mappings.

Plus these static mappings aren't all that static either, e.g. pci
access also can revoke bar mappings nowadays.

I think nothing except full mmu_notifier will actually fix this.
-Daniel

> +#else
> +       return -EINVAL;
> +#endif
> +
>         if (write_fault && !pte_write(*ptep)) {
>                 pfn = KVM_PFN_ERR_RO_FAULT;
>                 goto out;
>
> Jason

Jason Gunthorpe Nov. 23, 2022, 3:04 p.m. UTC | #20

On Wed, Nov 23, 2022 at 03:28:27PM +0100, Daniel Vetter wrote:

> > This patch is known to be broken in so many ways. It also has a major
> > security hole that it ignores the PTE flags making the page
> > RO. Ignoring the special bit is somehow not surprising :(
> >
> > This probably doesn't work, but is the general idea of what KVM needs
> > to do:
> 
> Oh dear, when I dug around in there I entirely missed that
> kvm_try_get_pfn exists, and it's very broken indeed. kvm really needs
> to grow a proper mmu notifier.
> 
> Another thing I'm wondering right now, the follow_pte();
> fixup_user_fault(); follow_pte(); approach does not make any
> guarantees of actually being right. If you're sufficiently unlucky you
> might race against an immediate pte invalidate between the fixup and
> the 2nd follow_pte(). But you can also not loop, because that would
> fail to catch permanent faults.

Yes, it is pretty broken.

kvm already has support for mmu notifiers and uses it for other
stuff. I can't remember what exactly this code path was for, IIRC
Paolo talked about having a big rework/fix for it when we last talked
about the missing write protect. I also vauagely recall he had some
explanation why this might be safe.

> I think the iommu fault drivers have a similar pattern.

Where? It shouldn't

The common code for SVA just calls handle_mm_fault() and restarts the
PRI. Since the page table is physically shared there is no issue with
a stale copy.

> What am I missing here? Or is that also just broken. gup works around
> this with the slow path that takes the mmap sem and walking the vma
> tree, follow_pte/fixup_user_fautl users dont.

follow_pte() is just fundamentally broken, things must not use it.

> Maybe mmu notifier based restarting would help with this too, if
> done properly.

That is called hmm_range_fault()

Jason

Jason Gunthorpe Nov. 23, 2022, 3:08 p.m. UTC | #21

On Wed, Nov 23, 2022 at 03:34:54PM +0100, Daniel Vetter wrote:
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 1376a47fedeedb..4161241fc3228c 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2598,6 +2598,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> >                         return r;
> >         }
> >
> > +       /*
> > +        * Special PTEs are never convertible into a struct page, even if the
> > +        * driver that owns them might have put a PFN with a struct page into
> > +        * the PFNMAP. If the arch doesn't support special then we cannot
> > +        * safely process these pages.
> > +        */
> > +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> > +       if (pte_special(*ptep))
> > +               return -EINVAL;
> 
> On second thought this wont work, because it completely defeats the
> point of why this code here exists. remap_pfn_range() (which is what
> the various dma_mmap functions and the ioremap functions are built on
> top of too) sets VM_PFNMAP too, so this check would even catch the
> static mappings.

The problem with the way this code is designed is how it allows
returning the pfn without taking any reference based on things like
!pfn_valid or page_reserved. This allows it to then conditionally put
back the reference based on the same reasoning. It is impossible to
thread pte special into that since it is a PTE flag, not a property of
the PFN.

I don't entirely understand why it needs the page reference at all,
even if it is available - so I can't guess why it is OK to ignore the
page reference in other cases, or why it is OK to be racy..

Eg hmm_range_fault() does not obtain page references and implements a
very similar algorithm to kvm.

> Plus these static mappings aren't all that static either, e.g. pci
> access also can revoke bar mappings nowadays.

And there are already mmu notifiers to handle that, AFAIK.

Jason

Christian König Nov. 23, 2022, 3:15 p.m. UTC | #22

Am 23.11.22 um 16:08 schrieb Jason Gunthorpe:
> On Wed, Nov 23, 2022 at 03:34:54PM +0100, Daniel Vetter wrote:
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index 1376a47fedeedb..4161241fc3228c 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -2598,6 +2598,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>>>                          return r;
>>>          }
>>>
>>> +       /*
>>> +        * Special PTEs are never convertible into a struct page, even if the
>>> +        * driver that owns them might have put a PFN with a struct page into
>>> +        * the PFNMAP. If the arch doesn't support special then we cannot
>>> +        * safely process these pages.
>>> +        */
>>> +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
>>> +       if (pte_special(*ptep))
>>> +               return -EINVAL;
>> On second thought this wont work, because it completely defeats the
>> point of why this code here exists. remap_pfn_range() (which is what
>> the various dma_mmap functions and the ioremap functions are built on
>> top of too) sets VM_PFNMAP too, so this check would even catch the
>> static mappings.
> The problem with the way this code is designed is how it allows
> returning the pfn without taking any reference based on things like
> !pfn_valid or page_reserved. This allows it to then conditionally put
> back the reference based on the same reasoning. It is impossible to
> thread pte special into that since it is a PTE flag, not a property of
> the PFN.
>
> I don't entirely understand why it needs the page reference at all,

That's exactly what I've pointed out in the previous discussion about 
that code as well.

As far as I can see it this is just another case where people assumed 
that grabbing a page reference somehow magically prevents the pte from 
changing.

I have not the slightest idea how people got this impression, but I have 
heard it so many time from so many different sources that there must be 
some common cause to this. Is the maybe some book or tutorial how to 
sophisticate break the kernel or something like this?

Anyway as far as I can see only correct approach would be to use an MMU 
notifier or more high level hmm_range_fault()+seq number.

Regards,
Christian.

> even if it is available - so I can't guess why it is OK to ignore the
> page reference in other cases, or why it is OK to be racy..
>
> Eg hmm_range_fault() does not obtain page references and implements a
> very similar algorithm to kvm.
>
>> Plus these static mappings aren't all that static either, e.g. pci
>> access also can revoke bar mappings nowadays.
> And there are already mmu notifiers to handle that, AFAIK.
>
> Jason

Daniel Vetter Nov. 23, 2022, 4:22 p.m. UTC | #23

On Wed, 23 Nov 2022 at 16:04, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Nov 23, 2022 at 03:28:27PM +0100, Daniel Vetter wrote:
>
> > > This patch is known to be broken in so many ways. It also has a major
> > > security hole that it ignores the PTE flags making the page
> > > RO. Ignoring the special bit is somehow not surprising :(
> > >
> > > This probably doesn't work, but is the general idea of what KVM needs
> > > to do:
> >
> > Oh dear, when I dug around in there I entirely missed that
> > kvm_try_get_pfn exists, and it's very broken indeed. kvm really needs
> > to grow a proper mmu notifier.
> >
> > Another thing I'm wondering right now, the follow_pte();
> > fixup_user_fault(); follow_pte(); approach does not make any
> > guarantees of actually being right. If you're sufficiently unlucky you
> > might race against an immediate pte invalidate between the fixup and
> > the 2nd follow_pte(). But you can also not loop, because that would
> > fail to catch permanent faults.
>
> Yes, it is pretty broken.
>
> kvm already has support for mmu notifiers and uses it for other
> stuff. I can't remember what exactly this code path was for, IIRC
> Paolo talked about having a big rework/fix for it when we last talked
> about the missing write protect. I also vauagely recall he had some
> explanation why this might be safe.
>
> > I think the iommu fault drivers have a similar pattern.
>
> Where? It shouldn't
>
> The common code for SVA just calls handle_mm_fault() and restarts the
> PRI. Since the page table is physically shared there is no issue with
> a stale copy.
>
> > What am I missing here? Or is that also just broken. gup works around
> > this with the slow path that takes the mmap sem and walking the vma
> > tree, follow_pte/fixup_user_fautl users dont.
>
> follow_pte() is just fundamentally broken, things must not use it.
>
> > Maybe mmu notifier based restarting would help with this too, if
> > done properly.
>
> That is called hmm_range_fault()

Ah right I mixed that up on a quick grep, thanks for pointing me in
the right direction. Worries appeased.
-Daniel

Daniel Vetter Nov. 23, 2022, 4:26 p.m. UTC | #24

On Wed, 23 Nov 2022 at 16:15, Christian König <christian.koenig@amd.com> wrote:
>
> Am 23.11.22 um 16:08 schrieb Jason Gunthorpe:
> > On Wed, Nov 23, 2022 at 03:34:54PM +0100, Daniel Vetter wrote:
> >>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >>> index 1376a47fedeedb..4161241fc3228c 100644
> >>> --- a/virt/kvm/kvm_main.c
> >>> +++ b/virt/kvm/kvm_main.c
> >>> @@ -2598,6 +2598,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> >>>                          return r;
> >>>          }
> >>>
> >>> +       /*
> >>> +        * Special PTEs are never convertible into a struct page, even if the
> >>> +        * driver that owns them might have put a PFN with a struct page into
> >>> +        * the PFNMAP. If the arch doesn't support special then we cannot
> >>> +        * safely process these pages.
> >>> +        */
> >>> +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> >>> +       if (pte_special(*ptep))
> >>> +               return -EINVAL;
> >> On second thought this wont work, because it completely defeats the
> >> point of why this code here exists. remap_pfn_range() (which is what
> >> the various dma_mmap functions and the ioremap functions are built on
> >> top of too) sets VM_PFNMAP too, so this check would even catch the
> >> static mappings.
> > The problem with the way this code is designed is how it allows
> > returning the pfn without taking any reference based on things like
> > !pfn_valid or page_reserved. This allows it to then conditionally put
> > back the reference based on the same reasoning. It is impossible to
> > thread pte special into that since it is a PTE flag, not a property of
> > the PFN.
> >
> > I don't entirely understand why it needs the page reference at all,
>
> That's exactly what I've pointed out in the previous discussion about
> that code as well.
>
> As far as I can see it this is just another case where people assumed
> that grabbing a page reference somehow magically prevents the pte from
> changing.
>
> I have not the slightest idea how people got this impression, but I have
> heard it so many time from so many different sources that there must be
> some common cause to this. Is the maybe some book or tutorial how to
> sophisticate break the kernel or something like this?

It's what get_user_pages does, so it does "work". Except this path
here is the fallback for when get_user_pages does not work (because of
the pte_special/VM_SPECIAL case). So essentially it's just a rather
broken get_user_pages that handrolls a bunch of things with
bugs&races.

I have no idea why people don't realize they're just reinventing gup
without using gup, but that's essentially what's going on.

> Anyway as far as I can see only correct approach would be to use an MMU
> notifier or more high level hmm_range_fault()+seq number.

Yeah, plus if you go through ptes you really have to obey all the
flags or things will break. Especially the RO pte flag.
-Daniel

>
> Regards,
> Christian.
>
> > even if it is available - so I can't guess why it is OK to ignore the
> > page reference in other cases, or why it is OK to be racy..
> >
> > Eg hmm_range_fault() does not obtain page references and implements a
> > very similar algorithm to kvm.
> >
> >> Plus these static mappings aren't all that static either, e.g. pci
> >> access also can revoke bar mappings nowadays.
> > And there are already mmu notifiers to handle that, AFAIK.
> >
> > Jason
>

Jason Gunthorpe Nov. 23, 2022, 4:26 p.m. UTC | #25

On Wed, Nov 23, 2022 at 04:15:05PM +0100, Christian König wrote:

> I have not the slightest idea how people got this impression, but I have
> heard it so many time from so many different sources that there must be some
> common cause to this. Is the maybe some book or tutorial how to sophisticate
> break the kernel or something like this?
> 
> Anyway as far as I can see only correct approach would be to use an MMU
> notifier or more high level hmm_range_fault()+seq number.

It already uses a mmu notifier, this is why I have no idea what the
page ref is doing..

Jason

dma-buf: Require VM_PFNMAP vma for mmap

Commit Message

Comments

Patch