diff mbox series

[2/2] mm/frame-vec: use FOLL_LONGTERM

Message ID 20201002175303.390363-2-daniel.vetter@ffwll.ch (mailing list archive)
State New, archived
Headers show
Series [1/2] mm/frame-vec: Drop gup_flags from get_vaddr_frames() | expand

Commit Message

Daniel Vetter Oct. 2, 2020, 5:53 p.m. UTC
For $reasons I've stumbled over this code and I'm not sure the change
to the new gup functions in 55a650c35fea ("mm/gup: frame_vector:
convert get_user_pages() --> pin_user_pages()") was entirely correct.

This here is used for long term buffers (not just quick I/O) like
RDMA, and John notes this in his patch. But I thought the rule for
these is that they need to add FOLL_LONGTERM, which John's patch
didn't do.

There is already a dax specific check (added in b7f0554a56f2 ("mm:
fail get_vaddr_frames() for filesystem-dax mappings")), so this seems
like the prudent thing to do.

Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-samsung-soc@vger.kernel.org
Cc: linux-media@vger.kernel.org
---
Hi all,

I stumbled over this and figured typing this patch can't hurt. Really
just to maybe learn a few things about how gup/pup is supposed to be
used (we have a bit of that in drivers/gpu), this here isn't really
ralated to anything I'm doing.

I'm also wondering whether the explicit dax check should be removed,
since FOLL_LONGTERM should take care of that already.
-Daniel
---
 mm/frame_vector.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Jason Gunthorpe Oct. 2, 2020, 6:06 p.m. UTC | #1
On Fri, Oct 02, 2020 at 07:53:03PM +0200, Daniel Vetter wrote:
> For $reasons I've stumbled over this code and I'm not sure the change
> to the new gup functions in 55a650c35fea ("mm/gup: frame_vector:
> convert get_user_pages() --> pin_user_pages()") was entirely correct.
> 
> This here is used for long term buffers (not just quick I/O) like
> RDMA, and John notes this in his patch. But I thought the rule for
> these is that they need to add FOLL_LONGTERM, which John's patch
> didn't do.
> 
> There is already a dax specific check (added in b7f0554a56f2 ("mm:
> fail get_vaddr_frames() for filesystem-dax mappings")), so this seems
> like the prudent thing to do.
> 
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Jérôme Glisse <jglisse@redhat.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: linux-mm@kvack.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-samsung-soc@vger.kernel.org
> Cc: linux-media@vger.kernel.org
> Hi all,
> 
> I stumbled over this and figured typing this patch can't hurt. Really
> just to maybe learn a few things about how gup/pup is supposed to be
> used (we have a bit of that in drivers/gpu), this here isn't really
> ralated to anything I'm doing.

FOLL_FORCE is a pretty big clue it should be FOLL_LONGTERM, IMHO

> I'm also wondering whether the explicit dax check should be removed,
> since FOLL_LONGTERM should take care of that already.

Yep! Confirms the above!

This get_vaddr_frames() thing looks impossible to use properly. How on
earth does a driver guarentee

 "If @start belongs to VM_IO | VM_PFNMAP vma, we don't touch page
 structures and the caller must make sure pfns aren't reused for
 anything else while he is using them."

The only possible way to do that is if the driver restricts the VMAs
to ones it owns and interacts with the vm_private data to refcount
something.

Since every driver does this wrong anything that uses this is creating
terrifying security issues.

IMHO this whole API should be deleted :(

Jason
Daniel Vetter Oct. 2, 2020, 6:16 p.m. UTC | #2
On Fri, Oct 2, 2020 at 8:06 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Fri, Oct 02, 2020 at 07:53:03PM +0200, Daniel Vetter wrote:
> > For $reasons I've stumbled over this code and I'm not sure the change
> > to the new gup functions in 55a650c35fea ("mm/gup: frame_vector:
> > convert get_user_pages() --> pin_user_pages()") was entirely correct.
> >
> > This here is used for long term buffers (not just quick I/O) like
> > RDMA, and John notes this in his patch. But I thought the rule for
> > these is that they need to add FOLL_LONGTERM, which John's patch
> > didn't do.
> >
> > There is already a dax specific check (added in b7f0554a56f2 ("mm:
> > fail get_vaddr_frames() for filesystem-dax mappings")), so this seems
> > like the prudent thing to do.
> >
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > Cc: Jérôme Glisse <jglisse@redhat.com>
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: linux-mm@kvack.org
> > Cc: linux-arm-kernel@lists.infradead.org
> > Cc: linux-samsung-soc@vger.kernel.org
> > Cc: linux-media@vger.kernel.org
> > Hi all,
> >
> > I stumbled over this and figured typing this patch can't hurt. Really
> > just to maybe learn a few things about how gup/pup is supposed to be
> > used (we have a bit of that in drivers/gpu), this here isn't really
> > ralated to anything I'm doing.
>
> FOLL_FORCE is a pretty big clue it should be FOLL_LONGTERM, IMHO

Since you're here ... I've noticed that ib sets FOLL_FORCE when the ib
verb access mode indicates possible writes. I'm not really clear on
why FOLL_WRITE isn't enough any why you need to be able to write
through a vma that's write protected currently.

> > I'm also wondering whether the explicit dax check should be removed,
> > since FOLL_LONGTERM should take care of that already.
>
> Yep! Confirms the above!
>
> This get_vaddr_frames() thing looks impossible to use properly. How on
> earth does a driver guarentee
>
>  "If @start belongs to VM_IO | VM_PFNMAP vma, we don't touch page
>  structures and the caller must make sure pfns aren't reused for
>  anything else while he is using them."
>
> The only possible way to do that is if the driver restricts the VMAs
> to ones it owns and interacts with the vm_private data to refcount
> something.
>
> Since every driver does this wrong anything that uses this is creating
> terrifying security issues.
>
> IMHO this whole API should be deleted :(

Yeah that part I just tried to conveniently ignore. I guess this dates
back to a time when ioremaps where at best fixed, and there wasn't
anything like a gpu driver dynamically managing vram around, resulting
in random entirely unrelated things possibly being mapped to that set
of pfns.

The underlying follow_pfn is also used in other places within
drivers/media, so this doesn't seem to be an accident, but actually
intentional.

I guess minimally we'd need a VM_PFNMAP flag for dynamically manged
drivers like modern drm gpu drivers, to make sure follow_pfn doesn't
follow these?
-Daniel
John Hubbard Oct. 2, 2020, 10:39 p.m. UTC | #3
On 10/2/20 10:53 AM, Daniel Vetter wrote:
> For $reasons I've stumbled over this code and I'm not sure the change
> to the new gup functions in 55a650c35fea ("mm/gup: frame_vector:
> convert get_user_pages() --> pin_user_pages()") was entirely correct.
> 
> This here is used for long term buffers (not just quick I/O) like
> RDMA, and John notes this in his patch. But I thought the rule for
> these is that they need to add FOLL_LONGTERM, which John's patch
> didn't do.

Yep. The earlier gup --> pup conversion patches were intended to not
have any noticeable behavior changes, and FOLL_LONGTERM, with it's
special cases and such, added some risk that I wasn't ready to take
on yet. Also, FOLL_LONGTERM rules are only *recently* getting firmed
up. So there was some doubt at least in my mind, about which sites
should have it.

But now that we're here, I think it's really good that you've brought
this up. It's definitely time to add FOLL_LONGTERM wherever it's missing.

thanks,
Jason Gunthorpe Oct. 2, 2020, 11:31 p.m. UTC | #4
On Fri, Oct 02, 2020 at 08:16:48PM +0200, Daniel Vetter wrote:
> On Fri, Oct 2, 2020 at 8:06 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > On Fri, Oct 02, 2020 at 07:53:03PM +0200, Daniel Vetter wrote:
> > > For $reasons I've stumbled over this code and I'm not sure the change
> > > to the new gup functions in 55a650c35fea ("mm/gup: frame_vector:
> > > convert get_user_pages() --> pin_user_pages()") was entirely correct.
> > >
> > > This here is used for long term buffers (not just quick I/O) like
> > > RDMA, and John notes this in his patch. But I thought the rule for
> > > these is that they need to add FOLL_LONGTERM, which John's patch
> > > didn't do.
> > >
> > > There is already a dax specific check (added in b7f0554a56f2 ("mm:
> > > fail get_vaddr_frames() for filesystem-dax mappings")), so this seems
> > > like the prudent thing to do.
> > >
> > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: John Hubbard <jhubbard@nvidia.com>
> > > Cc: Jérôme Glisse <jglisse@redhat.com>
> > > Cc: Jan Kara <jack@suse.cz>
> > > Cc: Dan Williams <dan.j.williams@intel.com>
> > > Cc: linux-mm@kvack.org
> > > Cc: linux-arm-kernel@lists.infradead.org
> > > Cc: linux-samsung-soc@vger.kernel.org
> > > Cc: linux-media@vger.kernel.org
> > > Hi all,
> > >
> > > I stumbled over this and figured typing this patch can't hurt. Really
> > > just to maybe learn a few things about how gup/pup is supposed to be
> > > used (we have a bit of that in drivers/gpu), this here isn't really
> > > ralated to anything I'm doing.
> >
> > FOLL_FORCE is a pretty big clue it should be FOLL_LONGTERM, IMHO
> 
> Since you're here ... I've noticed that ib sets FOLL_FORCE when the ib
> verb access mode indicates possible writes. I'm not really clear on
> why FOLL_WRITE isn't enough any why you need to be able to write
> through a vma that's write protected currently.

Ah, FOLL_FORCE | FOLL_WRITE means *read* confusingly enough, and the
only reason you'd want this version for read is if you are doing
longterm stuff. I wrote about this recently:

https://lore.kernel.org/linux-mm/20200928235739.GU9916@ziepe.ca/

> > Since every driver does this wrong anything that uses this is creating
> > terrifying security issues.
> >
> > IMHO this whole API should be deleted :(
> 
> Yeah that part I just tried to conveniently ignore. I guess this dates
> back to a time when ioremaps where at best fixed, and there wasn't
> anything like a gpu driver dynamically managing vram around, resulting
> in random entirely unrelated things possibly being mapped to that set
> of pfns.

No, it was always wrong. Prior to GPU like cases the lifetime of the
PTE was tied to the vma and when the vma becomes free the driver can
move the things in the PTEs to 'free'. Easy to trigger use-after-free
issues and devices like RDMA have security contexts attached to these
PTEs so it becomes a serious security bug to do something like this.

> The underlying follow_pfn is also used in other places within
> drivers/media, so this doesn't seem to be an accident, but actually
> intentional.

Looking closely, there are very few users, most *seem* pointless, but
maybe there is a crazy reason?

The sequence 
  get_vaddr_frames(); 
  frame_vector_to_pages();
  sg_alloc_table_from_pages(); 

Should be written
  pin_user_pages_fast(FOLL_LONGTERM); 
  sg_alloc_table_from_pages()

There is some 'special' code in frame_vector_to_pages() that tries to
get a struct page for things from a VM_IO or VM_PFNMAP...

Oh snap, that is *completely* broken! If the first VMA is IO|PFNMAP
then get_vaddr_frames() iterates over all VMAs in the range, of any
kind and extracts the PTEs then blindly references them! This means it
can be used to use after free normal RAM struct pages!! Gah!

Wow. Okay. That has to go.

So, I *think* we can assume there is no sane cases where
frame_vector_to_pages() succeeds but pin_user_pages() wasn't called.

That means the users here:
 - habanalabs:  Hey Oded can you fix this up?

 - gpu/exynos: Daniel can you get someone there to stop using it?

 - media/videobuf via vb2_dma_sg_get_userptr()

Should all be switched to the standard pin_user_pages sequence
above.

That leaves the only interesting places as vb2_dc_get_userptr() and
vb2_vmalloc_get_userptr() which both completely fail to follow the
REQUIRED behavior in the function's comment about checking PTEs. It
just DMA maps them. Badly broken.

Guessing this hackery is for some embedded P2P DMA transfer?

After he three places above should use pin_user_pages_fast(), then
this whole broken API should be moved into videobuf2-memops.c and a
big fat "THIS DOESN'T WORK" stuck on it.

videobuf2 should probably use P2P DMA buf for this instead.

Jason
Oded Gabbay Oct. 3, 2020, 8:34 a.m. UTC | #5
On Sat, Oct 3, 2020 at 2:31 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Oct 02, 2020 at 08:16:48PM +0200, Daniel Vetter wrote:
> > On Fri, Oct 2, 2020 at 8:06 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > On Fri, Oct 02, 2020 at 07:53:03PM +0200, Daniel Vetter wrote:
> > > > For $reasons I've stumbled over this code and I'm not sure the change
> > > > to the new gup functions in 55a650c35fea ("mm/gup: frame_vector:
> > > > convert get_user_pages() --> pin_user_pages()") was entirely correct.
> > > >
> > > > This here is used for long term buffers (not just quick I/O) like
> > > > RDMA, and John notes this in his patch. But I thought the rule for
> > > > these is that they need to add FOLL_LONGTERM, which John's patch
> > > > didn't do.
> > > >
> > > > There is already a dax specific check (added in b7f0554a56f2 ("mm:
> > > > fail get_vaddr_frames() for filesystem-dax mappings")), so this seems
> > > > like the prudent thing to do.
> > > >
> > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > Cc: John Hubbard <jhubbard@nvidia.com>
> > > > Cc: Jérôme Glisse <jglisse@redhat.com>
> > > > Cc: Jan Kara <jack@suse.cz>
> > > > Cc: Dan Williams <dan.j.williams@intel.com>
> > > > Cc: linux-mm@kvack.org
> > > > Cc: linux-arm-kernel@lists.infradead.org
> > > > Cc: linux-samsung-soc@vger.kernel.org
> > > > Cc: linux-media@vger.kernel.org
> > > > Hi all,
> > > >
> > > > I stumbled over this and figured typing this patch can't hurt. Really
> > > > just to maybe learn a few things about how gup/pup is supposed to be
> > > > used (we have a bit of that in drivers/gpu), this here isn't really
> > > > ralated to anything I'm doing.
> > >
> > > FOLL_FORCE is a pretty big clue it should be FOLL_LONGTERM, IMHO
> >
> > Since you're here ... I've noticed that ib sets FOLL_FORCE when the ib
> > verb access mode indicates possible writes. I'm not really clear on
> > why FOLL_WRITE isn't enough any why you need to be able to write
> > through a vma that's write protected currently.
>
> Ah, FOLL_FORCE | FOLL_WRITE means *read* confusingly enough, and the
> only reason you'd want this version for read is if you are doing
> longterm stuff. I wrote about this recently:
>
> https://lore.kernel.org/linux-mm/20200928235739.GU9916@ziepe.ca/
>
> > > Since every driver does this wrong anything that uses this is creating
> > > terrifying security issues.
> > >
> > > IMHO this whole API should be deleted :(
> >
> > Yeah that part I just tried to conveniently ignore. I guess this dates
> > back to a time when ioremaps where at best fixed, and there wasn't
> > anything like a gpu driver dynamically managing vram around, resulting
> > in random entirely unrelated things possibly being mapped to that set
> > of pfns.
>
> No, it was always wrong. Prior to GPU like cases the lifetime of the
> PTE was tied to the vma and when the vma becomes free the driver can
> move the things in the PTEs to 'free'. Easy to trigger use-after-free
> issues and devices like RDMA have security contexts attached to these
> PTEs so it becomes a serious security bug to do something like this.
>
> > The underlying follow_pfn is also used in other places within
> > drivers/media, so this doesn't seem to be an accident, but actually
> > intentional.
>
> Looking closely, there are very few users, most *seem* pointless, but
> maybe there is a crazy reason?
>
> The sequence
>   get_vaddr_frames();
>   frame_vector_to_pages();
>   sg_alloc_table_from_pages();
>
> Should be written
>   pin_user_pages_fast(FOLL_LONGTERM);
>   sg_alloc_table_from_pages()
>
> There is some 'special' code in frame_vector_to_pages() that tries to
> get a struct page for things from a VM_IO or VM_PFNMAP...
>
> Oh snap, that is *completely* broken! If the first VMA is IO|PFNMAP
> then get_vaddr_frames() iterates over all VMAs in the range, of any
> kind and extracts the PTEs then blindly references them! This means it
> can be used to use after free normal RAM struct pages!! Gah!
>
> Wow. Okay. That has to go.
>
> So, I *think* we can assume there is no sane cases where
> frame_vector_to_pages() succeeds but pin_user_pages() wasn't called.
>
> That means the users here:
>  - habanalabs:  Hey Oded can you fix this up?
Yes, no problem, I'll do it very soon.
Thanks,
Oded

>
>  - gpu/exynos: Daniel can you get someone there to stop using it?
>
>  - media/videobuf via vb2_dma_sg_get_userptr()
>
> Should all be switched to the standard pin_user_pages sequence
> above.
>
> That leaves the only interesting places as vb2_dc_get_userptr() and
> vb2_vmalloc_get_userptr() which both completely fail to follow the
> REQUIRED behavior in the function's comment about checking PTEs. It
> just DMA maps them. Badly broken.
>
> Guessing this hackery is for some embedded P2P DMA transfer?
>
> After he three places above should use pin_user_pages_fast(), then
> this whole broken API should be moved into videobuf2-memops.c and a
> big fat "THIS DOESN'T WORK" stuck on it.
>
> videobuf2 should probably use P2P DMA buf for this instead.
>
> Jason
Daniel Vetter Oct. 3, 2020, 9:40 a.m. UTC | #6
On Sat, Oct 3, 2020 at 1:31 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Oct 02, 2020 at 08:16:48PM +0200, Daniel Vetter wrote:
> > On Fri, Oct 2, 2020 at 8:06 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > On Fri, Oct 02, 2020 at 07:53:03PM +0200, Daniel Vetter wrote:
> > > > For $reasons I've stumbled over this code and I'm not sure the change
> > > > to the new gup functions in 55a650c35fea ("mm/gup: frame_vector:
> > > > convert get_user_pages() --> pin_user_pages()") was entirely correct.
> > > >
> > > > This here is used for long term buffers (not just quick I/O) like
> > > > RDMA, and John notes this in his patch. But I thought the rule for
> > > > these is that they need to add FOLL_LONGTERM, which John's patch
> > > > didn't do.
> > > >
> > > > There is already a dax specific check (added in b7f0554a56f2 ("mm:
> > > > fail get_vaddr_frames() for filesystem-dax mappings")), so this seems
> > > > like the prudent thing to do.
> > > >
> > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > Cc: John Hubbard <jhubbard@nvidia.com>
> > > > Cc: Jérôme Glisse <jglisse@redhat.com>
> > > > Cc: Jan Kara <jack@suse.cz>
> > > > Cc: Dan Williams <dan.j.williams@intel.com>
> > > > Cc: linux-mm@kvack.org
> > > > Cc: linux-arm-kernel@lists.infradead.org
> > > > Cc: linux-samsung-soc@vger.kernel.org
> > > > Cc: linux-media@vger.kernel.org
> > > > Hi all,
> > > >
> > > > I stumbled over this and figured typing this patch can't hurt. Really
> > > > just to maybe learn a few things about how gup/pup is supposed to be
> > > > used (we have a bit of that in drivers/gpu), this here isn't really
> > > > ralated to anything I'm doing.
> > >
> > > FOLL_FORCE is a pretty big clue it should be FOLL_LONGTERM, IMHO
> >
> > Since you're here ... I've noticed that ib sets FOLL_FORCE when the ib
> > verb access mode indicates possible writes. I'm not really clear on
> > why FOLL_WRITE isn't enough any why you need to be able to write
> > through a vma that's write protected currently.
>
> Ah, FOLL_FORCE | FOLL_WRITE means *read* confusingly enough, and the
> only reason you'd want this version for read is if you are doing
> longterm stuff. I wrote about this recently:
>
> https://lore.kernel.org/linux-mm/20200928235739.GU9916@ziepe.ca/

Ah, so essentially it serves as a FOLL_GET_COW_ISSUES_OUT_OF_MY_WAY. I
think documentation for this, and/or just automatically adding the
flag set combination would be really good. But I see John is already
on top of that it seems.

Thanks for the explainer.

> > > Since every driver does this wrong anything that uses this is creating
> > > terrifying security issues.
> > >
> > > IMHO this whole API should be deleted :(
> >
> > Yeah that part I just tried to conveniently ignore. I guess this dates
> > back to a time when ioremaps where at best fixed, and there wasn't
> > anything like a gpu driver dynamically managing vram around, resulting
> > in random entirely unrelated things possibly being mapped to that set
> > of pfns.
>
> No, it was always wrong. Prior to GPU like cases the lifetime of the
> PTE was tied to the vma and when the vma becomes free the driver can
> move the things in the PTEs to 'free'. Easy to trigger use-after-free
> issues and devices like RDMA have security contexts attached to these
> PTEs so it becomes a serious security bug to do something like this.
>
> > The underlying follow_pfn is also used in other places within
> > drivers/media, so this doesn't seem to be an accident, but actually
> > intentional.
>
> Looking closely, there are very few users, most *seem* pointless, but
> maybe there is a crazy reason?
>
> The sequence
>   get_vaddr_frames();
>   frame_vector_to_pages();
>   sg_alloc_table_from_pages();
>
> Should be written
>   pin_user_pages_fast(FOLL_LONGTERM);
>   sg_alloc_table_from_pages()

Ok, that takes care of exynos and habanalabs. I'll try and wack
together a patch for exynos, that driver is a bit special - first arm
soc driver and we merged it fully well aware that it's full of warts,
just as a show to make it clear that drivers/gpu is also interested in
small gpu drivers ...

> There is some 'special' code in frame_vector_to_pages() that tries to
> get a struct page for things from a VM_IO or VM_PFNMAP...
>
> Oh snap, that is *completely* broken! If the first VMA is IO|PFNMAP
> then get_vaddr_frames() iterates over all VMAs in the range, of any
> kind and extracts the PTEs then blindly references them! This means it
> can be used to use after free normal RAM struct pages!! Gah!
>
> Wow. Okay. That has to go.
>
> So, I *think* we can assume there is no sane cases where
> frame_vector_to_pages() succeeds but pin_user_pages() wasn't called.
>
> That means the users here:
>  - habanalabs:  Hey Oded can you fix this up?
>
>  - gpu/exynos: Daniel can you get someone there to stop using it?
>
>  - media/videobuf via vb2_dma_sg_get_userptr()
>
> Should all be switched to the standard pin_user_pages sequence
> above.
>
> That leaves the only interesting places as vb2_dc_get_userptr() and
> vb2_vmalloc_get_userptr() which both completely fail to follow the
> REQUIRED behavior in the function's comment about checking PTEs. It
> just DMA maps them. Badly broken.
>
> Guessing this hackery is for some embedded P2P DMA transfer?

Yeah, see also the follow_pfn trickery in
videobuf_dma_contig_user_get(), I think this is fully intentional and
userspace abi we can't break :-/

It's mildly better than just sharing phys_addr_t through userspace and
blindly trusting it, which is what all the out-of-tree crap did, but
only mildly.

> After he three places above should use pin_user_pages_fast(), then
> this whole broken API should be moved into videobuf2-memops.c and a
> big fat "THIS DOESN'T WORK" stuck on it.
>
> videobuf2 should probably use P2P DMA buf for this instead.

Yup this should be done with dma_buf instead, and v4l has that. But
old uapi and all that. This is why I said we might need a new
VM_DYNAMIC_PFNMAP or so, to make follow_pfn not resolve this in the
case where the driver manages the underlying iomem range (or whatever
it is) dynamically and moves buffer objects around, like drm drivers
do. But I looked, and we've run out of vma->vm_flags :-(

The other problem is that I also have no real working clue about all
the VM_* flags and what they all mean, and whether drm drivers set the
right ones in all cases (they probably don't, but oh well).
Documentation for this stuff in headers is a bit thin at times.
-Daniel
Daniel Vetter Oct. 3, 2020, 9:45 a.m. UTC | #7
On Sat, Oct 3, 2020 at 12:39 AM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 10/2/20 10:53 AM, Daniel Vetter wrote:
> > For $reasons I've stumbled over this code and I'm not sure the change
> > to the new gup functions in 55a650c35fea ("mm/gup: frame_vector:
> > convert get_user_pages() --> pin_user_pages()") was entirely correct.
> >
> > This here is used for long term buffers (not just quick I/O) like
> > RDMA, and John notes this in his patch. But I thought the rule for
> > these is that they need to add FOLL_LONGTERM, which John's patch
> > didn't do.
>
> Yep. The earlier gup --> pup conversion patches were intended to not
> have any noticeable behavior changes, and FOLL_LONGTERM, with it's
> special cases and such, added some risk that I wasn't ready to take
> on yet. Also, FOLL_LONGTERM rules are only *recently* getting firmed
> up. So there was some doubt at least in my mind, about which sites
> should have it.
>
> But now that we're here, I think it's really good that you've brought
> this up. It's definitely time to add FOLL_LONGTERM wherever it's missing.

So should I keep this patch, or will it collide with a series you're working on?

Also with the firmed up rules, correct that I can also drop the
vma_is_fsdax check when the FOLL_LONGTERM flag is set?

Thanks, Daniel

>
> thanks,
> --
> John Hubbard
> NVIDIA
>
> >
> > There is already a dax specific check (added in b7f0554a56f2 ("mm:
> > fail get_vaddr_frames() for filesystem-dax mappings")), so this seems
> > like the prudent thing to do.
> >
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > Cc: Jérôme Glisse <jglisse@redhat.com>
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: linux-mm@kvack.org
> > Cc: linux-arm-kernel@lists.infradead.org
> > Cc: linux-samsung-soc@vger.kernel.org
> > Cc: linux-media@vger.kernel.org
> > ---
> > Hi all,
> >
> > I stumbled over this and figured typing this patch can't hurt. Really
> > just to maybe learn a few things about how gup/pup is supposed to be
> > used (we have a bit of that in drivers/gpu), this here isn't really
> > ralated to anything I'm doing.
> >
> > I'm also wondering whether the explicit dax check should be removed,
> > since FOLL_LONGTERM should take care of that already.
> > -Daniel
> > ---
> >   mm/frame_vector.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/frame_vector.c b/mm/frame_vector.c
> > index 5d34c9047e9c..3507e09cb3ff 100644
> > --- a/mm/frame_vector.c
> > +++ b/mm/frame_vector.c
> > @@ -35,7 +35,7 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
> >   {
> >       struct mm_struct *mm = current->mm;
> >       struct vm_area_struct *vma;
> > -     unsigned int gup_flags = FOLL_WRITE | FOLL_FORCE;
> > +     unsigned int gup_flags = FOLL_WRITE | FOLL_FORCE | FOLL_LONGTERM;
> >       int ret = 0;
> >       int err;
> >       int locked;
> >
>
John Hubbard Oct. 3, 2020, 10:52 p.m. UTC | #8
On 10/3/20 2:45 AM, Daniel Vetter wrote:
> On Sat, Oct 3, 2020 at 12:39 AM John Hubbard <jhubbard@nvidia.com> wrote:
>>
>> On 10/2/20 10:53 AM, Daniel Vetter wrote:
>>> For $reasons I've stumbled over this code and I'm not sure the change
>>> to the new gup functions in 55a650c35fea ("mm/gup: frame_vector:
>>> convert get_user_pages() --> pin_user_pages()") was entirely correct.
>>>
>>> This here is used for long term buffers (not just quick I/O) like
>>> RDMA, and John notes this in his patch. But I thought the rule for
>>> these is that they need to add FOLL_LONGTERM, which John's patch
>>> didn't do.
>>
>> Yep. The earlier gup --> pup conversion patches were intended to not
>> have any noticeable behavior changes, and FOLL_LONGTERM, with it's
>> special cases and such, added some risk that I wasn't ready to take
>> on yet. Also, FOLL_LONGTERM rules are only *recently* getting firmed
>> up. So there was some doubt at least in my mind, about which sites
>> should have it.
>>
>> But now that we're here, I think it's really good that you've brought
>> this up. It's definitely time to add FOLL_LONGTERM wherever it's missing.
> 
> So should I keep this patch, or will it collide with a series you're working on?

It doesn't collide with anything on my end yet, because I've been slow to
pick up on the need for changing callsites to add FOLL_LONGTERM. :)

And it looks like that's actually a problem, because:

> 
> Also with the firmed up rules, correct that I can also drop the
> vma_is_fsdax check when the FOLL_LONGTERM flag is set?

That's the right direction to go *in general*, but I see that the
pin_user_pages code is still a bit stuck in the past. And this patch
won't actually work, with or without that vma_is_fsdax() check.
Because:

get_vaddr_frames(FOLL_LONGTERM)
    pin_user_pages_locked()
	if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
		return -EINVAL;


So, again, pin_user_pages*() is at least partly behind the times here.
I can jump in and start fixing it up, but it depends on what you and
Oded and others are planning? Note: there is a particular combination of
dax and locking that we have to still avoid, within gup.c. That's
already covered, but needs to continue to be covered when we enable
FOLL_LONGTERM in the remaining pin_user_pages*() calling paths.



thanks,
Jason Gunthorpe Oct. 3, 2020, 11:24 p.m. UTC | #9
On Sat, Oct 03, 2020 at 03:52:32PM -0700, John Hubbard wrote:
> On 10/3/20 2:45 AM, Daniel Vetter wrote:
> > On Sat, Oct 3, 2020 at 12:39 AM John Hubbard <jhubbard@nvidia.com> wrote:
> > > 
> > > On 10/2/20 10:53 AM, Daniel Vetter wrote:
> > > > For $reasons I've stumbled over this code and I'm not sure the change
> > > > to the new gup functions in 55a650c35fea ("mm/gup: frame_vector:
> > > > convert get_user_pages() --> pin_user_pages()") was entirely correct.
> > > > 
> > > > This here is used for long term buffers (not just quick I/O) like
> > > > RDMA, and John notes this in his patch. But I thought the rule for
> > > > these is that they need to add FOLL_LONGTERM, which John's patch
> > > > didn't do.
> > > 
> > > Yep. The earlier gup --> pup conversion patches were intended to not
> > > have any noticeable behavior changes, and FOLL_LONGTERM, with it's
> > > special cases and such, added some risk that I wasn't ready to take
> > > on yet. Also, FOLL_LONGTERM rules are only *recently* getting firmed
> > > up. So there was some doubt at least in my mind, about which sites
> > > should have it.
> > > 
> > > But now that we're here, I think it's really good that you've brought
> > > this up. It's definitely time to add FOLL_LONGTERM wherever it's missing.
> > 
> > So should I keep this patch, or will it collide with a series you're working on?
> 
> It doesn't collide with anything on my end yet, because I've been slow to
> pick up on the need for changing callsites to add FOLL_LONGTERM. :)
> 
> And it looks like that's actually a problem, because:
> 
> > 
> > Also with the firmed up rules, correct that I can also drop the
> > vma_is_fsdax check when the FOLL_LONGTERM flag is set?
> 
> That's the right direction to go *in general*, but I see that the
> pin_user_pages code is still a bit stuck in the past. And this patch
> won't actually work, with or without that vma_is_fsdax() check.
> Because:
> 
> get_vaddr_frames(FOLL_LONGTERM)
>    pin_user_pages_locked()
> 	if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
> 		return -EINVAL;

There is no particular reason this code needs to have the mm sem at
that point.

It should call pin_user_pages_fast() and only if that fails get the mmap
lock and extract the VMA to do broken hackery.

Jason
Daniel Vetter Oct. 4, 2020, 11:20 a.m. UTC | #10
On Sun, Oct 4, 2020 at 1:24 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Sat, Oct 03, 2020 at 03:52:32PM -0700, John Hubbard wrote:
> > On 10/3/20 2:45 AM, Daniel Vetter wrote:
> > > On Sat, Oct 3, 2020 at 12:39 AM John Hubbard <jhubbard@nvidia.com> wrote:
> > > >
> > > > On 10/2/20 10:53 AM, Daniel Vetter wrote:
> > > > > For $reasons I've stumbled over this code and I'm not sure the change
> > > > > to the new gup functions in 55a650c35fea ("mm/gup: frame_vector:
> > > > > convert get_user_pages() --> pin_user_pages()") was entirely correct.
> > > > >
> > > > > This here is used for long term buffers (not just quick I/O) like
> > > > > RDMA, and John notes this in his patch. But I thought the rule for
> > > > > these is that they need to add FOLL_LONGTERM, which John's patch
> > > > > didn't do.
> > > >
> > > > Yep. The earlier gup --> pup conversion patches were intended to not
> > > > have any noticeable behavior changes, and FOLL_LONGTERM, with it's
> > > > special cases and such, added some risk that I wasn't ready to take
> > > > on yet. Also, FOLL_LONGTERM rules are only *recently* getting firmed
> > > > up. So there was some doubt at least in my mind, about which sites
> > > > should have it.
> > > >
> > > > But now that we're here, I think it's really good that you've brought
> > > > this up. It's definitely time to add FOLL_LONGTERM wherever it's missing.
> > >
> > > So should I keep this patch, or will it collide with a series you're working on?
> >
> > It doesn't collide with anything on my end yet, because I've been slow to
> > pick up on the need for changing callsites to add FOLL_LONGTERM. :)
> >
> > And it looks like that's actually a problem, because:
> >
> > >
> > > Also with the firmed up rules, correct that I can also drop the
> > > vma_is_fsdax check when the FOLL_LONGTERM flag is set?
> >
> > That's the right direction to go *in general*, but I see that the
> > pin_user_pages code is still a bit stuck in the past. And this patch
> > won't actually work, with or without that vma_is_fsdax() check.
> > Because:
> >
> > get_vaddr_frames(FOLL_LONGTERM)
> >    pin_user_pages_locked()
> >       if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
> >               return -EINVAL;
>
> There is no particular reason this code needs to have the mm sem at
> that point.
>
> It should call pin_user_pages_fast() and only if that fails get the mmap
> lock and extract the VMA to do broken hackery.

Yeah I think that works. I tried understanding gup.c code a bit more,
and it looks like FOLL_LONGTERM only works for the pup_fast variant
right now? All others seem to have this comment that it's somehow
incompatible with FAULT_FLAG_ALLOW_RETRY and daxfs. But grepping
around for that didn't show up anything, at least not nearby dax code.
For my understanding of all this, what's the hiccup there?

For plans, I only started this for a bit of my own learning, but I
think I'll respin with the following changes:
- convert exynos and habanalabs to pin_user_pages_fast directly,
instead of going through this frame-vector detour
- move the locking and convert get_vaddr_frames to pup_fast as Jason suggested
- hack up some truly gross rfc to plug the follow_pfn hole

Cheers, Daniel

--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
Jason Gunthorpe Oct. 4, 2020, 12:50 p.m. UTC | #11
On Sat, Oct 03, 2020 at 11:40:22AM +0200, Daniel Vetter wrote:

> > That leaves the only interesting places as vb2_dc_get_userptr() and
> > vb2_vmalloc_get_userptr() which both completely fail to follow the
> > REQUIRED behavior in the function's comment about checking PTEs. It
> > just DMA maps them. Badly broken.
> >
> > Guessing this hackery is for some embedded P2P DMA transfer?
> 
> Yeah, see also the follow_pfn trickery in
> videobuf_dma_contig_user_get(), I think this is fully intentional and
> userspace abi we can't break :-/

We don't need to break uABI, it just needs to work properly in the
kernel:

  vma = find_vma_intersection()
  dma_buf = dma_buf_get_from_vma(vma)
  sg = dma_buf_p2p_dma_map(dma_buf)
  [.. do dma ..]
  dma_buf_unmap(sg)
  dma_buf_put(dma_buf)

It is as we discussed before, dma buf needs to be discoverable from a
VMA, at least for users doing this kind of stuff.

> Yup this should be done with dma_buf instead, and v4l has that. But
> old uapi and all that. This is why I said we might need a new
> VM_DYNAMIC_PFNMAP or so, to make follow_pfn not resolve this in the
> case where the driver manages the underlying iomem range (or whatever
> it is) dynamically and moves buffer objects around, like drm drivers
> do. But I looked, and we've run out of vma->vm_flags :-(

A VM flag doesn't help - we need to introduce some kind of lifetime,
and that has to be derived from the VMA. It needs data not just a flag

> The other problem is that I also have no real working clue about all
> the VM_* flags and what they all mean, and whether drm drivers set the
> right ones in all cases (they probably don't, but oh well).
> Documentation for this stuff in headers is a bit thin at times.

Yah, I don't really know either :\

The comment above vm_normal_page() is a bit helpful. Don't know what
VM_IO/VM_PFNMAP mean in their 3 combinations

There are very few places that set VM_PFNMAP without VM_IO..

Jason
Daniel Vetter Oct. 4, 2020, 4:09 p.m. UTC | #12
On Sun, Oct 4, 2020 at 2:51 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Sat, Oct 03, 2020 at 11:40:22AM +0200, Daniel Vetter wrote:
>
> > > That leaves the only interesting places as vb2_dc_get_userptr() and
> > > vb2_vmalloc_get_userptr() which both completely fail to follow the
> > > REQUIRED behavior in the function's comment about checking PTEs. It
> > > just DMA maps them. Badly broken.
> > >
> > > Guessing this hackery is for some embedded P2P DMA transfer?
> >
> > Yeah, see also the follow_pfn trickery in
> > videobuf_dma_contig_user_get(), I think this is fully intentional and
> > userspace abi we can't break :-/
>
> We don't need to break uABI, it just needs to work properly in the
> kernel:
>
>   vma = find_vma_intersection()
>   dma_buf = dma_buf_get_from_vma(vma)
>   sg = dma_buf_p2p_dma_map(dma_buf)
>   [.. do dma ..]
>   dma_buf_unmap(sg)
>   dma_buf_put(dma_buf)
>
> It is as we discussed before, dma buf needs to be discoverable from a
> VMA, at least for users doing this kind of stuff.

I'm not a big fan of magic behaviour like this, there's more to
dma-buf buffer sharing than just "how do I get at the backing
storage". Thus far we've done everything rather explicitly. Plus with
exynos and habanalabs converted there's only v4l left over, and that
has a proper dma-buf import path already.

> > Yup this should be done with dma_buf instead, and v4l has that. But
> > old uapi and all that. This is why I said we might need a new
> > VM_DYNAMIC_PFNMAP or so, to make follow_pfn not resolve this in the
> > case where the driver manages the underlying iomem range (or whatever
> > it is) dynamically and moves buffer objects around, like drm drivers
> > do. But I looked, and we've run out of vma->vm_flags :-(
>
> A VM flag doesn't help - we need to introduce some kind of lifetime,
> and that has to be derived from the VMA. It needs data not just a flag

I don't want to make it work, I just want to make it fail. Rough idea
I have in mind is to add a follow_pfn_longterm, for all callers which
aren't either synchronized through mmap_sem or an mmu_notifier. If
this really breaks anyone's use-case we can add a tainting kernel
option which re-enables this (we've done something similar for
phys_addr_t based buffer sharing in fbdev, entirely unfixable since
the other driver has to just blindly trust that what userspace passes
around is legit). This here isn't unfixable, but if v4l people want to
keep it without a big "security hole here" sticker, they should do the
work, not me :-)

> > The other problem is that I also have no real working clue about all
> > the VM_* flags and what they all mean, and whether drm drivers set the
> > right ones in all cases (they probably don't, but oh well).
> > Documentation for this stuff in headers is a bit thin at times.
>
> Yah, I don't really know either :\
>
> The comment above vm_normal_page() is a bit helpful. Don't know what
> VM_IO/VM_PFNMAP mean in their 3 combinations
>
> There are very few places that set VM_PFNMAP without VM_IO..

Best I could find is:
- mk68 seems to outright reject pagefaults on VM_IO vma
- some places set VM_IO together with VM_MIXEDMAP instead of
VM_PFNMAP. There's some comments that this makes cow possible, but I
guess that's for the old pfn remap vma (remap_file_pages, which is
removed now). But really no clue.

VM_IO | VM_MIXEDMAP kinda makes me wonder whether follow_pfn gets the
page refcounting all right or horribly wrong in some cases ...
-Daniel
Jan Kara Oct. 5, 2020, 3:03 p.m. UTC | #13
Hi!

On Fri 02-10-20 15:06:03, Jason Gunthorpe wrote:
> This get_vaddr_frames() thing looks impossible to use properly. How on
> earth does a driver guarentee
> 
>  "If @start belongs to VM_IO | VM_PFNMAP vma, we don't touch page
>  structures and the caller must make sure pfns aren't reused for
>  anything else while he is using them."
> 
> The only possible way to do that is if the driver restricts the VMAs
> to ones it owns and interacts with the vm_private data to refcount
> something.
> 
> Since every driver does this wrong anything that uses this is creating
> terrifying security issues.
> 
> IMHO this whole API should be deleted :(

So I'm the one guilty for introducing this API. The API was created to
factor out code in several (mostly V4L AFAIR) drivers thus reducing amount
of drivers poking into MM internals and getting things wrong in various
cases. It may well be that the API is still broken from "can driver ensure
this" POV - I tried to keep things things as they were before in this
regard as I have very little knowledge in how these drivers are supposed to
work.

Anyway, if you can make this go away, sure go ahead :)

								Honza
Jason Gunthorpe Oct. 5, 2020, 5:28 p.m. UTC | #14
On Sun, Oct 04, 2020 at 06:09:29PM +0200, Daniel Vetter wrote:
> On Sun, Oct 4, 2020 at 2:51 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Sat, Oct 03, 2020 at 11:40:22AM +0200, Daniel Vetter wrote:
> >
> > > > That leaves the only interesting places as vb2_dc_get_userptr() and
> > > > vb2_vmalloc_get_userptr() which both completely fail to follow the
> > > > REQUIRED behavior in the function's comment about checking PTEs. It
> > > > just DMA maps them. Badly broken.
> > > >
> > > > Guessing this hackery is for some embedded P2P DMA transfer?
> > >
> > > Yeah, see also the follow_pfn trickery in
> > > videobuf_dma_contig_user_get(), I think this is fully intentional and
> > > userspace abi we can't break :-/
> >
> > We don't need to break uABI, it just needs to work properly in the
> > kernel:
> >
> >   vma = find_vma_intersection()
> >   dma_buf = dma_buf_get_from_vma(vma)
> >   sg = dma_buf_p2p_dma_map(dma_buf)
> >   [.. do dma ..]
> >   dma_buf_unmap(sg)
> >   dma_buf_put(dma_buf)
> >
> > It is as we discussed before, dma buf needs to be discoverable from a
> > VMA, at least for users doing this kind of stuff.
> 
> I'm not a big fan of magic behaviour like this, there's more to
> dma-buf buffer sharing than just "how do I get at the backing
> storage". Thus far we've done everything rather explicitly. Plus with
> exynos and habanalabs converted there's only v4l left over, and that
> has a proper dma-buf import path already.

Well, any VA approach like this has to access some backing refcount
via the VMA. Not really any way to avoid something like that

> > A VM flag doesn't help - we need to introduce some kind of lifetime,
> > and that has to be derived from the VMA. It needs data not just a flag
> 
> I don't want to make it work, I just want to make it fail. Rough idea
> I have in mind is to add a follow_pfn_longterm, for all callers which
> aren't either synchronized through mmap_sem or an mmu_notifier. 

follow_pfn() doesn't work outside the pagetable locks or mmu notifier
protection. Can't be fixed.

We only have a few users:

arch/s390/pci/pci_mmio.c:       ret = follow_pfn(vma, user_addr, pfn);
drivers/media/v4l2-core/videobuf-dma-contig.c:          ret = follow_pfn(vma, user_address, &this_pfn);
drivers/vfio/vfio_iommu_type1.c:        ret = follow_pfn(vma, vaddr, pfn);
drivers/vfio/vfio_iommu_type1.c:                ret = follow_pfn(vma, vaddr, pfn);
mm/frame_vector.c:                      err = follow_pfn(vma, start, &nums[ret]);
virt/kvm/kvm_main.c:    r = follow_pfn(vma, addr, &pfn);
virt/kvm/kvm_main.c:            r = follow_pfn(vma, addr, &pfn);

VFIO is broken like media, but I saw patches fixing the vfio cases
using the VMA and a vfio specific refcount.

media & frame_vector we are talking about here.

kvm is some similar hack added for P2P DMA, see commit
add6a0cd1c5ba51b201e1361b05a5df817083618. It might be protected by notifiers..

s390 looks broken too, needs to hold the page table locks.

So, the answer really is that s390 and media need fixing, and this API
should go away (or become kvm specific)

> If this really breaks anyone's use-case we can add a tainting kernel
> option which re-enables this (we've done something similar for
> phys_addr_t based buffer sharing in fbdev, entirely unfixable since
> the other driver has to just blindly trust that what userspace
> passes around is legit). This here isn't unfixable, but if v4l
> people want to keep it without a big "security hole here" sticker,
> they should do the work, not me :-)

This seems fairly reasonable..

So after frame_vec is purged and we have the one caller in media, move
all this stuff to media and taint the kernel if it goes down the
follow_pfn path

Jason
Jason Gunthorpe Oct. 5, 2020, 5:35 p.m. UTC | #15
On Sun, Oct 04, 2020 at 01:20:31PM +0200, Daniel Vetter wrote:

> Yeah I think that works. I tried understanding gup.c code a bit more,
> and it looks like FOLL_LONGTERM only works for the pup_fast variant
> right now? All others seem to have this comment that it's somehow
> incompatible with FAULT_FLAG_ALLOW_RETRY and daxfs. But grepping
> around for that didn't show up anything, at least not nearby dax code.
> For my understanding of all this, what's the hiccup there?

IIRC it needs the VMA and various other flows can't return the vma
list becuse they unlock the mmap_sem will running

> For plans, I only started this for a bit of my own learning, but I
> think I'll respin with the following changes:
> - convert exynos and habanalabs to pin_user_pages_fast directly,
> instead of going through this frame-vector detour

+1

Jason
Daniel Vetter Oct. 5, 2020, 6:16 p.m. UTC | #16
On Mon, Oct 5, 2020 at 7:28 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Sun, Oct 04, 2020 at 06:09:29PM +0200, Daniel Vetter wrote:
> > On Sun, Oct 4, 2020 at 2:51 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Sat, Oct 03, 2020 at 11:40:22AM +0200, Daniel Vetter wrote:
> > >
> > > > > That leaves the only interesting places as vb2_dc_get_userptr() and
> > > > > vb2_vmalloc_get_userptr() which both completely fail to follow the
> > > > > REQUIRED behavior in the function's comment about checking PTEs. It
> > > > > just DMA maps them. Badly broken.
> > > > >
> > > > > Guessing this hackery is for some embedded P2P DMA transfer?
> > > >
> > > > Yeah, see also the follow_pfn trickery in
> > > > videobuf_dma_contig_user_get(), I think this is fully intentional and
> > > > userspace abi we can't break :-/
> > >
> > > We don't need to break uABI, it just needs to work properly in the
> > > kernel:
> > >
> > >   vma = find_vma_intersection()
> > >   dma_buf = dma_buf_get_from_vma(vma)
> > >   sg = dma_buf_p2p_dma_map(dma_buf)
> > >   [.. do dma ..]
> > >   dma_buf_unmap(sg)
> > >   dma_buf_put(dma_buf)
> > >
> > > It is as we discussed before, dma buf needs to be discoverable from a
> > > VMA, at least for users doing this kind of stuff.
> >
> > I'm not a big fan of magic behaviour like this, there's more to
> > dma-buf buffer sharing than just "how do I get at the backing
> > storage". Thus far we've done everything rather explicitly. Plus with
> > exynos and habanalabs converted there's only v4l left over, and that
> > has a proper dma-buf import path already.
>
> Well, any VA approach like this has to access some backing refcount
> via the VMA. Not really any way to avoid something like that
>
> > > A VM flag doesn't help - we need to introduce some kind of lifetime,
> > > and that has to be derived from the VMA. It needs data not just a flag
> >
> > I don't want to make it work, I just want to make it fail. Rough idea
> > I have in mind is to add a follow_pfn_longterm, for all callers which
> > aren't either synchronized through mmap_sem or an mmu_notifier.
>
> follow_pfn() doesn't work outside the pagetable locks or mmu notifier
> protection. Can't be fixed.
>
> We only have a few users:
>
> arch/s390/pci/pci_mmio.c:       ret = follow_pfn(vma, user_addr, pfn);
> drivers/media/v4l2-core/videobuf-dma-contig.c:          ret = follow_pfn(vma, user_address, &this_pfn);
> drivers/vfio/vfio_iommu_type1.c:        ret = follow_pfn(vma, vaddr, pfn);
> drivers/vfio/vfio_iommu_type1.c:                ret = follow_pfn(vma, vaddr, pfn);
> mm/frame_vector.c:                      err = follow_pfn(vma, start, &nums[ret]);
> virt/kvm/kvm_main.c:    r = follow_pfn(vma, addr, &pfn);
> virt/kvm/kvm_main.c:            r = follow_pfn(vma, addr, &pfn);
>
> VFIO is broken like media, but I saw patches fixing the vfio cases
> using the VMA and a vfio specific refcount.
>
> media & frame_vector we are talking about here.
>
> kvm is some similar hack added for P2P DMA, see commit
> add6a0cd1c5ba51b201e1361b05a5df817083618. It might be protected by notifiers..

Yeah my thinking is that kvm (and I think also vfio, also seems to
have mmu notifier nearby) are ok because of the mmu notiifer. Assuming
that one works correctly.

> s390 looks broken too, needs to hold the page table locks.

Hm yeah I guess that looks fairly reasonable to fix too.

> So, the answer really is that s390 and media need fixing, and this API
> should go away (or become kvm specific)

I'm still not clear how you want fo fix this, since your vma->dma_buf
idea is kinda a decade long plan and so just not going to happen:
- v4l used this mostly (afaik the lore at least) for buffer sharing
with v4l itself, and also a bit with fbdev. Neither even has any
dma-buf exporter code as-is.
- like I said, there's no central dma-buf instance, it was fairly
intentionally create as an all-to-all abstraction. Which means you
either have to roll out a vm_ops->gimme_the_dmabuf or, even more work,
refactor all the dma-buf exporters to go through the same things
- even where we have dma-buf, most mmaps of buffer objects aren't a
dma-buf. Those  are only set up when userspace explicitly asks for
one, so we'd also need to change the mmap code of all drivers involved
to make sure the dma-buf is always created when we do any kind of
mmap.

I don't see that as a realistic thing to ever happen, and meanwhile we
can't leave the gap open for a few years.

> > If this really breaks anyone's use-case we can add a tainting kernel
> > option which re-enables this (we've done something similar for
> > phys_addr_t based buffer sharing in fbdev, entirely unfixable since
> > the other driver has to just blindly trust that what userspace
> > passes around is legit). This here isn't unfixable, but if v4l
> > people want to keep it without a big "security hole here" sticker,
> > they should do the work, not me :-)
>
> This seems fairly reasonable..
>
> So after frame_vec is purged and we have the one caller in media, move
> all this stuff to media and taint the kernel if it goes down the
> follow_pfn path

Yeah I think moving frame_vec back to media sounds like a good idea,
it should stop new users like habanalbas/exynos from popping up at
least. It's follow_pfn that freaks me out more.
-Daniel

-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
Jason Gunthorpe Oct. 5, 2020, 6:37 p.m. UTC | #17
On Mon, Oct 05, 2020 at 08:16:33PM +0200, Daniel Vetter wrote:

> > kvm is some similar hack added for P2P DMA, see commit
> > add6a0cd1c5ba51b201e1361b05a5df817083618. It might be protected by notifiers..
> 
> Yeah my thinking is that kvm (and I think also vfio, also seems to
> have mmu notifier nearby) are ok because of the mmu notiifer. Assuming
> that one works correctly.

vfio doesn't have a notifier, Alex was looking to add a vfio private
scheme in the vma->private_data:

https://lore.kernel.org/kvm/159017449210.18853.15037950701494323009.stgit@gimli.home/

Guess it never happened.

> > So, the answer really is that s390 and media need fixing, and this API
> > should go away (or become kvm specific)
> 
> I'm still not clear how you want fo fix this, since your vma->dma_buf
> idea is kinda a decade long plan and so just not going to happen:

Well, it doesn't mean we have to change every part of dma_buf to
participate in this. Just the bits media cares about. Or maybe it is
some higher level varient on top of dma_buf.

Or don't use dma_buf for this, add a new object that just provides
refcounts and P2P DMA connection for IO pfn ranges..

Jason
Daniel Vetter Oct. 5, 2020, 6:54 p.m. UTC | #18
On Mon, Oct 5, 2020 at 8:37 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Mon, Oct 05, 2020 at 08:16:33PM +0200, Daniel Vetter wrote:
>
> > > kvm is some similar hack added for P2P DMA, see commit
> > > add6a0cd1c5ba51b201e1361b05a5df817083618. It might be protected by notifiers..
> >
> > Yeah my thinking is that kvm (and I think also vfio, also seems to
> > have mmu notifier nearby) are ok because of the mmu notiifer. Assuming
> > that one works correctly.
>
> vfio doesn't have a notifier, Alex was looking to add a vfio private
> scheme in the vma->private_data:
>
> https://lore.kernel.org/kvm/159017449210.18853.15037950701494323009.stgit@gimli.home/
>
> Guess it never happened.

I was mislead by the mmu notifier in drivers/vfio/vfio.c. But looking
closer, that's only used by some drivers, I guess to make sure their
device pagetables are kept in sync with reality. And not to make sure
the vfio pfn view is kept in sync with reality.

This could get real nasty I think.

> > > So, the answer really is that s390 and media need fixing, and this API
> > > should go away (or become kvm specific)
> >
> > I'm still not clear how you want fo fix this, since your vma->dma_buf
> > idea is kinda a decade long plan and so just not going to happen:
>
> Well, it doesn't mean we have to change every part of dma_buf to
> participate in this. Just the bits media cares about. Or maybe it is
> some higher level varient on top of dma_buf.
>
> Or don't use dma_buf for this, add a new object that just provides
> refcounts and P2P DMA connection for IO pfn ranges..

So good news is, I dug some layers deeper in v4l, and there's only 2
users which do actually handle pfn and don't immediately convert to a
pages array:
- videbuf-dma-contig.c. Luckily videobuf 1 is deprecated since
forever, so I think we might get away with either just breaking this,
or at least tainting kernels and hiding it behind a nasty Kconfig.
This only uses follow_pfn, which we need to keep anyway for vfio in
the unsafe variant :-/
- videbuf2-vmalloc.c Digging through history this was added to support
import of v4l buffers from drivers that needed contig memory. And way
back before CMA, that meant carveout memory not backed by struct page
*. That should now all have struct pages and be managed by CMA (since
videbuf2-dma-contig.c just uses dma_alloc_coherent underneath), so I
think we can just switch to pin_user_pages(FOLL_LONGTERM here too).

iow I think I can outright delete the frame vector stuff.
-Daniel
Daniel Vetter Oct. 5, 2020, 10:43 p.m. UTC | #19
On Mon, Oct 5, 2020 at 8:54 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Mon, Oct 5, 2020 at 8:37 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Mon, Oct 05, 2020 at 08:16:33PM +0200, Daniel Vetter wrote:
> >
> > > > kvm is some similar hack added for P2P DMA, see commit
> > > > add6a0cd1c5ba51b201e1361b05a5df817083618. It might be protected by notifiers..
> > >
> > > Yeah my thinking is that kvm (and I think also vfio, also seems to
> > > have mmu notifier nearby) are ok because of the mmu notiifer. Assuming
> > > that one works correctly.
> >
> > vfio doesn't have a notifier, Alex was looking to add a vfio private
> > scheme in the vma->private_data:
> >
> > https://lore.kernel.org/kvm/159017449210.18853.15037950701494323009.stgit@gimli.home/
> >
> > Guess it never happened.
>
> I was mislead by the mmu notifier in drivers/vfio/vfio.c. But looking
> closer, that's only used by some drivers, I guess to make sure their
> device pagetables are kept in sync with reality. And not to make sure
> the vfio pfn view is kept in sync with reality.
>
> This could get real nasty I think.
>
> > > > So, the answer really is that s390 and media need fixing, and this API
> > > > should go away (or become kvm specific)
> > >
> > > I'm still not clear how you want fo fix this, since your vma->dma_buf
> > > idea is kinda a decade long plan and so just not going to happen:
> >
> > Well, it doesn't mean we have to change every part of dma_buf to
> > participate in this. Just the bits media cares about. Or maybe it is
> > some higher level varient on top of dma_buf.
> >
> > Or don't use dma_buf for this, add a new object that just provides
> > refcounts and P2P DMA connection for IO pfn ranges..
>
> So good news is, I dug some layers deeper in v4l, and there's only 2
> users which do actually handle pfn and don't immediately convert to a
> pages array:
> - videbuf-dma-contig.c. Luckily videobuf 1 is deprecated since
> forever, so I think we might get away with either just breaking this,
> or at least tainting kernels and hiding it behind a nasty Kconfig.
> This only uses follow_pfn, which we need to keep anyway for vfio in
> the unsafe variant :-/
> - videbuf2-vmalloc.c Digging through history this was added to support
> import of v4l buffers from drivers that needed contig memory. And way
> back before CMA, that meant carveout memory not backed by struct page
> *. That should now all have struct pages and be managed by CMA (since
> videbuf2-dma-contig.c just uses dma_alloc_coherent underneath), so I
> think we can just switch to pin_user_pages(FOLL_LONGTERM here too).
>
> iow I think I can outright delete the frame vector stuff.

Ok this doesn't work, because dma_mmap always uses a remap_pfn_range,
which is a VM_IO | VM_PFNMAP vma and so even if it's cma backed and
not a carveout, we can't get the pages. Plus trying to move the cma
pages out of cma for FOLL_LONGTERM would be kinda bad when they've
been allocated as a contig block by dma_alloc_coherent :-)

So this idea of switching over to pup only is going to break zerocopy.
I guess I'll need something else for this then.
-Daniel
Jason Gunthorpe Oct. 5, 2020, 11:41 p.m. UTC | #20
On Tue, Oct 06, 2020 at 12:43:31AM +0200, Daniel Vetter wrote:

> > iow I think I can outright delete the frame vector stuff.
> 
> Ok this doesn't work, because dma_mmap always uses a remap_pfn_range,
> which is a VM_IO | VM_PFNMAP vma and so even if it's cma backed and
> not a carveout, we can't get the pages.

If CMA memory has struct pages it probably should be mmap'd with
different flags, and the lifecycle of the CMA memory needs to respect
the struct page refcount?

> Plus trying to move the cma pages out of cma for FOLL_LONGTERM would
> be kinda bad when they've been allocated as a contig block by
> dma_alloc_coherent :-)

Isn't holding a long term reference to a CMA page one of those really
scary use-after-free security issues I've been talking about?

I know nothing about CMA, so can't say too much, sorry

Jason
Daniel Vetter Oct. 6, 2020, 6:23 a.m. UTC | #21
On Tue, Oct 6, 2020 at 1:41 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Oct 06, 2020 at 12:43:31AM +0200, Daniel Vetter wrote:
>
> > > iow I think I can outright delete the frame vector stuff.
> >
> > Ok this doesn't work, because dma_mmap always uses a remap_pfn_range,
> > which is a VM_IO | VM_PFNMAP vma and so even if it's cma backed and
> > not a carveout, we can't get the pages.
>
> If CMA memory has struct pages it probably should be mmap'd with
> different flags, and the lifecycle of the CMA memory needs to respect
> the struct page refcount?

I guess yes and no. The problem is if there's pagecache in the cma
region, pup(FOLL_LONGTERM) needs to first migrate those pages out of
the cma range. Because all normal page allocation in cma regions must
be migratable at all times. But when you use cma as the contig
allocator (mostly with dma_alloc_coherent) and then remap that for
userspace (e.g. dma_mmap_wc), then anyone doing pup or gup should not
try to migrate anything. Also in the past these contig ranges where
generally carveouts without any struct page, changing that would break
too much I guess.

> > Plus trying to move the cma pages out of cma for FOLL_LONGTERM would
> > be kinda bad when they've been allocated as a contig block by
> > dma_alloc_coherent :-)
>
> Isn't holding a long term reference to a CMA page one of those really
> scary use-after-free security issues I've been talking about?
>
> I know nothing about CMA, so can't say too much, sorry

Uh ... yes :-/

This is actually worse than the gpu case I had in mind, where at most
you can sneak access other gpu buffers. With cma you should be able to
get at arbitrary pagecache (well anything that's GFP_MOVEABLE really).
Nice :-(
-Daniel
Jason Gunthorpe Oct. 6, 2020, 12:26 p.m. UTC | #22
On Tue, Oct 06, 2020 at 08:23:23AM +0200, Daniel Vetter wrote:
> On Tue, Oct 6, 2020 at 1:41 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Tue, Oct 06, 2020 at 12:43:31AM +0200, Daniel Vetter wrote:
> >
> > > > iow I think I can outright delete the frame vector stuff.
> > >
> > > Ok this doesn't work, because dma_mmap always uses a remap_pfn_range,
> > > which is a VM_IO | VM_PFNMAP vma and so even if it's cma backed and
> > > not a carveout, we can't get the pages.
> >
> > If CMA memory has struct pages it probably should be mmap'd with
> > different flags, and the lifecycle of the CMA memory needs to respect
> > the struct page refcount?
> 
> I guess yes and no. The problem is if there's pagecache in the cma
> region, pup(FOLL_LONGTERM) needs to first migrate those pages out of
> the cma range. Because all normal page allocation in cma regions must
> be migratable at all times.

Eh? Then how are we doing follow_pfn() on this stuff and not being
completely broken?

The entire point of this framevec API is to pin the memory and
preventing it from moving around. 

Sounds like it is fundamentally incompatible with CMA. Why is
something trying to mix the two?

> This is actually worse than the gpu case I had in mind, where at most
> you can sneak access other gpu buffers. With cma you should be able to
> get at arbitrary pagecache (well anything that's GFP_MOVEABLE really).
> Nice :-(

Ah, we have a winner :\

Jason
Daniel Vetter Oct. 6, 2020, 1:08 p.m. UTC | #23
On Tue, Oct 6, 2020 at 2:26 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Oct 06, 2020 at 08:23:23AM +0200, Daniel Vetter wrote:
> > On Tue, Oct 6, 2020 at 1:41 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Tue, Oct 06, 2020 at 12:43:31AM +0200, Daniel Vetter wrote:
> > >
> > > > > iow I think I can outright delete the frame vector stuff.
> > > >
> > > > Ok this doesn't work, because dma_mmap always uses a remap_pfn_range,
> > > > which is a VM_IO | VM_PFNMAP vma and so even if it's cma backed and
> > > > not a carveout, we can't get the pages.
> > >
> > > If CMA memory has struct pages it probably should be mmap'd with
> > > different flags, and the lifecycle of the CMA memory needs to respect
> > > the struct page refcount?
> >
> > I guess yes and no. The problem is if there's pagecache in the cma
> > region, pup(FOLL_LONGTERM) needs to first migrate those pages out of
> > the cma range. Because all normal page allocation in cma regions must
> > be migratable at all times.
>
> Eh? Then how are we doing follow_pfn() on this stuff and not being
> completely broken?
>
> The entire point of this framevec API is to pin the memory and
> preventing it from moving around.
>
> Sounds like it is fundamentally incompatible with CMA. Why is
> something trying to mix the two?

I think the assumption way back when this started is that any VM_IO |
VM_PFNMAP vma is perma-pinned because it's just a piece of carveout.
Of course this ignored that it could also be a piece of iomem and
peer2peer dma doens't Just Work, so could result in all kinds of
hilarity and hw exceptions. But no leaks. Well, if you assume that the
ownership of a device never changes after you've booted the system.

But now we have dynamic gpu memory management, a bunch of subsystems
that fully support revoke semantics (in a subsystem specific way), and
CMA trying really hard to make the old carveouts useable for the
system at large when the memory isn't needed by the device. So all
these assumptions behind follow_pfn are out of the window, and
follow_pfn is pretty much broken.

What's worse I noticed that even for static pfnmaps (for userspace
drivers) we now revoke access to those mmaps. For example implemented
for /dev/mem in 3234ac664a87 ("/dev/mem: Revoke mappings when a driver
claims the region"). Which means follow_pfn isn't even working
correctly anymore for that case, and it's all pretty much broken.

> > This is actually worse than the gpu case I had in mind, where at most
> > you can sneak access other gpu buffers. With cma you should be able to
> > get at arbitrary pagecache (well anything that's GFP_MOVEABLE really).
> > Nice :-(
>
> Ah, we have a winner :\

Cheers, Daniel
Marek Szyprowski Oct. 7, 2020, 10:47 a.m. UTC | #24
Hi Daniel,

On 03.10.2020 11:40, Daniel Vetter wrote:
>> After he three places above should use pin_user_pages_fast(), then
>> this whole broken API should be moved into videobuf2-memops.c and a
>> big fat "THIS DOESN'T WORK" stuck on it.
>>
>> videobuf2 should probably use P2P DMA buf for this instead.
> Yup this should be done with dma_buf instead, and v4l has that.

Yes, V4L2 has dma_buf support NOW. That days, using so called V4L2 
USERPTR method was the only way to achieve zero copy buffer sharing 
between devices, so this is just a historical baggage. I've been 
actively involved in implementing that. I've tried to make it secure as 
much as possible assuming the limitation of that approach. With a few 
assumptions it works fine. Buffers are refcounted both by the 
vm_ops->open or by incrementing the refcount of the vm->file. This 
basically works with any sane driver, which doesn't free the mmaped 
buffer until the file is released. This is true for V4L2 and FBdev devices.

This API is considered as deprecated in V4L2 world, so I think 
supporting this hack can be removed one day as nowadays userspace should 
use dma buf.

 > ...

Best regards
Daniel Vetter Oct. 7, 2020, 12:01 p.m. UTC | #25
On Wed, Oct 7, 2020 at 12:47 PM Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
>
> Hi Daniel,
>
> On 03.10.2020 11:40, Daniel Vetter wrote:
> >> After he three places above should use pin_user_pages_fast(), then
> >> this whole broken API should be moved into videobuf2-memops.c and a
> >> big fat "THIS DOESN'T WORK" stuck on it.
> >>
> >> videobuf2 should probably use P2P DMA buf for this instead.
> > Yup this should be done with dma_buf instead, and v4l has that.
>
> Yes, V4L2 has dma_buf support NOW. That days, using so called V4L2
> USERPTR method was the only way to achieve zero copy buffer sharing
> between devices, so this is just a historical baggage. I've been
> actively involved in implementing that. I've tried to make it secure as
> much as possible assuming the limitation of that approach. With a few
> assumptions it works fine. Buffers are refcounted both by the
> vm_ops->open or by incrementing the refcount of the vm->file. This
> basically works with any sane driver, which doesn't free the mmaped
> buffer until the file is released. This is true for V4L2 and FBdev devices.

I'm not seeing any of that vm->file refcounting going on, so not
seeing anything that prevents the mmap area from being removed. Can
you pls give me some pointers about which code you're thinking of?
-Daniel

> This API is considered as deprecated in V4L2 world, so I think
> supporting this hack can be removed one day as nowadays userspace should
> use dma buf.
>
>  > ...
>
> Best regards
> --
> Marek Szyprowski, PhD
> Samsung R&D Institute Poland
>
Marek Szyprowski Oct. 7, 2020, 12:33 p.m. UTC | #26
Hi Daniel,

On 07.10.2020 14:01, Daniel Vetter wrote:
> On Wed, Oct 7, 2020 at 12:47 PM Marek Szyprowski
> <m.szyprowski@samsung.com> wrote:
>> On 03.10.2020 11:40, Daniel Vetter wrote:
>>>> After he three places above should use pin_user_pages_fast(), then
>>>> this whole broken API should be moved into videobuf2-memops.c and a
>>>> big fat "THIS DOESN'T WORK" stuck on it.
>>>>
>>>> videobuf2 should probably use P2P DMA buf for this instead.
>>> Yup this should be done with dma_buf instead, and v4l has that.
>> Yes, V4L2 has dma_buf support NOW. That days, using so called V4L2
>> USERPTR method was the only way to achieve zero copy buffer sharing
>> between devices, so this is just a historical baggage. I've been
>> actively involved in implementing that. I've tried to make it secure as
>> much as possible assuming the limitation of that approach. With a few
>> assumptions it works fine. Buffers are refcounted both by the
>> vm_ops->open or by incrementing the refcount of the vm->file. This
>> basically works with any sane driver, which doesn't free the mmaped
>> buffer until the file is released. This is true for V4L2 and FBdev devices.
> I'm not seeing any of that vm->file refcounting going on, so not
> seeing anything that prevents the mmap area from being removed. Can
> you pls give me some pointers about which code you're thinking of?

Well, it was in vb2_get_vma() function, but now I see that it has been 
lost in fb639eb39154 and 6690c8c78c74 some time ago...

Best regards
Jason Gunthorpe Oct. 7, 2020, 12:44 p.m. UTC | #27
On Wed, Oct 07, 2020 at 02:33:56PM +0200, Marek Szyprowski wrote:
> Well, it was in vb2_get_vma() function, but now I see that it has been 
> lost in fb639eb39154 and 6690c8c78c74 some time ago...

There is no guarentee that holding a get on the file says anthing
about the VMA. This needed to check that the file was some special
kind of file that promised the VMA layout and file lifetime are
connected.

Also, cloning a VMA outside the mm world is just really bad. That
would screw up many assumptions the drivers make.

If it is all obsolete I say we hide it behind a default n config
symbol and taint the kernel if anything uses it.

Add a big comment above the follow_pfn to warn others away from this
code.

Jason
Tomasz Figa Oct. 7, 2020, 12:47 p.m. UTC | #28
On Wed, Oct 7, 2020 at 2:44 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Oct 07, 2020 at 02:33:56PM +0200, Marek Szyprowski wrote:
> > Well, it was in vb2_get_vma() function, but now I see that it has been
> > lost in fb639eb39154 and 6690c8c78c74 some time ago...
>
> There is no guarentee that holding a get on the file says anthing
> about the VMA. This needed to check that the file was some special
> kind of file that promised the VMA layout and file lifetime are
> connected.
>
> Also, cloning a VMA outside the mm world is just really bad. That
> would screw up many assumptions the drivers make.
>
> If it is all obsolete I say we hide it behind a default n config
> symbol and taint the kernel if anything uses it.
>
> Add a big comment above the follow_pfn to warn others away from this
> code.

Sadly it's just verbally declared as deprecated and not formally noted
anyway. There are a lot of userspace applications relying on user
pointer support.
Daniel Vetter Oct. 7, 2020, 12:58 p.m. UTC | #29
On Wed, Oct 7, 2020 at 2:48 PM Tomasz Figa <tfiga@chromium.org> wrote:
>
> On Wed, Oct 7, 2020 at 2:44 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Wed, Oct 07, 2020 at 02:33:56PM +0200, Marek Szyprowski wrote:
> > > Well, it was in vb2_get_vma() function, but now I see that it has been
> > > lost in fb639eb39154 and 6690c8c78c74 some time ago...
> >
> > There is no guarentee that holding a get on the file says anthing
> > about the VMA. This needed to check that the file was some special
> > kind of file that promised the VMA layout and file lifetime are
> > connected.
> >
> > Also, cloning a VMA outside the mm world is just really bad. That
> > would screw up many assumptions the drivers make.
> >
> > If it is all obsolete I say we hide it behind a default n config
> > symbol and taint the kernel if anything uses it.
> >
> > Add a big comment above the follow_pfn to warn others away from this
> > code.
>
> Sadly it's just verbally declared as deprecated and not formally noted
> anyway. There are a lot of userspace applications relying on user
> pointer support.

userptr can stay, it's the userptr abuse for zerocpy buffer sharing
which doesn't work anymore. At least without major surgery (you'd need
an mmu notifier to zap mappings and recreate them, and that pretty
much breaks the v4l model of preallocating all buffers to make sure we
never underflow the buffer queue). And static mappings are not coming
back I think, we'll go ever more into the direction of dynamic
mappings and moving stuff around as needed.
-Daniel
Jason Gunthorpe Oct. 7, 2020, 1:06 p.m. UTC | #30
On Wed, Oct 07, 2020 at 02:58:33PM +0200, Daniel Vetter wrote:
> On Wed, Oct 7, 2020 at 2:48 PM Tomasz Figa <tfiga@chromium.org> wrote:
> >
> > On Wed, Oct 7, 2020 at 2:44 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Wed, Oct 07, 2020 at 02:33:56PM +0200, Marek Szyprowski wrote:
> > > > Well, it was in vb2_get_vma() function, but now I see that it has been
> > > > lost in fb639eb39154 and 6690c8c78c74 some time ago...
> > >
> > > There is no guarentee that holding a get on the file says anthing
> > > about the VMA. This needed to check that the file was some special
> > > kind of file that promised the VMA layout and file lifetime are
> > > connected.
> > >
> > > Also, cloning a VMA outside the mm world is just really bad. That
> > > would screw up many assumptions the drivers make.
> > >
> > > If it is all obsolete I say we hide it behind a default n config
> > > symbol and taint the kernel if anything uses it.
> > >
> > > Add a big comment above the follow_pfn to warn others away from this
> > > code.
> >
> > Sadly it's just verbally declared as deprecated and not formally noted
> > anyway. There are a lot of userspace applications relying on user
> > pointer support.
> 
> userptr can stay, it's the userptr abuse for zerocpy buffer sharing
> which doesn't work anymore. At least without major surgery (you'd need
> an mmu notifier to zap mappings and recreate them, and that pretty
> much breaks the v4l model of preallocating all buffers to make sure we
> never underflow the buffer queue). And static mappings are not coming
> back I think, we'll go ever more into the direction of dynamic
> mappings and moving stuff around as needed.

Right, and to be clear, the last time I saw a security flaw of this
magnitude from a subsystem badly mis-designing itself, Linus's
knee-jerk reaction was to propose to remove the whole subsystem.

Please don't take status-quo as acceptable, V4L community has to work
to resolve this, uABI breakage or not. The follow_pfn related code
must be compiled out of normal distro kernel builds.

Jason
Tomasz Figa Oct. 7, 2020, 1:06 p.m. UTC | #31
On Sat, Oct 3, 2020 at 1:31 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Oct 02, 2020 at 08:16:48PM +0200, Daniel Vetter wrote:
> > On Fri, Oct 2, 2020 at 8:06 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > On Fri, Oct 02, 2020 at 07:53:03PM +0200, Daniel Vetter wrote:
> > > > For $reasons I've stumbled over this code and I'm not sure the change
> > > > to the new gup functions in 55a650c35fea ("mm/gup: frame_vector:
> > > > convert get_user_pages() --> pin_user_pages()") was entirely correct.
> > > >
> > > > This here is used for long term buffers (not just quick I/O) like
> > > > RDMA, and John notes this in his patch. But I thought the rule for
> > > > these is that they need to add FOLL_LONGTERM, which John's patch
> > > > didn't do.
> > > >
> > > > There is already a dax specific check (added in b7f0554a56f2 ("mm:
> > > > fail get_vaddr_frames() for filesystem-dax mappings")), so this seems
> > > > like the prudent thing to do.
> > > >
> > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > Cc: John Hubbard <jhubbard@nvidia.com>
> > > > Cc: Jérôme Glisse <jglisse@redhat.com>
> > > > Cc: Jan Kara <jack@suse.cz>
> > > > Cc: Dan Williams <dan.j.williams@intel.com>
> > > > Cc: linux-mm@kvack.org
> > > > Cc: linux-arm-kernel@lists.infradead.org
> > > > Cc: linux-samsung-soc@vger.kernel.org
> > > > Cc: linux-media@vger.kernel.org
> > > > Hi all,
> > > >
> > > > I stumbled over this and figured typing this patch can't hurt. Really
> > > > just to maybe learn a few things about how gup/pup is supposed to be
> > > > used (we have a bit of that in drivers/gpu), this here isn't really
> > > > ralated to anything I'm doing.
> > >
> > > FOLL_FORCE is a pretty big clue it should be FOLL_LONGTERM, IMHO
> >
> > Since you're here ... I've noticed that ib sets FOLL_FORCE when the ib
> > verb access mode indicates possible writes. I'm not really clear on
> > why FOLL_WRITE isn't enough any why you need to be able to write
> > through a vma that's write protected currently.
>
> Ah, FOLL_FORCE | FOLL_WRITE means *read* confusingly enough, and the
> only reason you'd want this version for read is if you are doing
> longterm stuff. I wrote about this recently:
>
> https://lore.kernel.org/linux-mm/20200928235739.GU9916@ziepe.ca/
>
> > > Since every driver does this wrong anything that uses this is creating
> > > terrifying security issues.
> > >
> > > IMHO this whole API should be deleted :(
> >
> > Yeah that part I just tried to conveniently ignore. I guess this dates
> > back to a time when ioremaps where at best fixed, and there wasn't
> > anything like a gpu driver dynamically managing vram around, resulting
> > in random entirely unrelated things possibly being mapped to that set
> > of pfns.
>
> No, it was always wrong. Prior to GPU like cases the lifetime of the
> PTE was tied to the vma and when the vma becomes free the driver can
> move the things in the PTEs to 'free'. Easy to trigger use-after-free
> issues and devices like RDMA have security contexts attached to these
> PTEs so it becomes a serious security bug to do something like this.
>
> > The underlying follow_pfn is also used in other places within
> > drivers/media, so this doesn't seem to be an accident, but actually
> > intentional.
>
> Looking closely, there are very few users, most *seem* pointless, but
> maybe there is a crazy reason?
>
> The sequence
>   get_vaddr_frames();
>   frame_vector_to_pages();
>   sg_alloc_table_from_pages();
>
> Should be written
>   pin_user_pages_fast(FOLL_LONGTERM);
>   sg_alloc_table_from_pages()
>
> There is some 'special' code in frame_vector_to_pages() that tries to
> get a struct page for things from a VM_IO or VM_PFNMAP...
>
> Oh snap, that is *completely* broken! If the first VMA is IO|PFNMAP
> then get_vaddr_frames() iterates over all VMAs in the range, of any
> kind and extracts the PTEs then blindly references them! This means it
> can be used to use after free normal RAM struct pages!! Gah!
>
> Wow. Okay. That has to go.
>
> So, I *think* we can assume there is no sane cases where
> frame_vector_to_pages() succeeds but pin_user_pages() wasn't called.
>
> That means the users here:
>  - habanalabs:  Hey Oded can you fix this up?
>
>  - gpu/exynos: Daniel can you get someone there to stop using it?
>
>  - media/videobuf via vb2_dma_sg_get_userptr()
>
> Should all be switched to the standard pin_user_pages sequence
> above.
>
> That leaves the only interesting places as vb2_dc_get_userptr() and
> vb2_vmalloc_get_userptr() which both completely fail to follow the
> REQUIRED behavior in the function's comment about checking PTEs. It
> just DMA maps them. Badly broken.

Note that vb2_vmalloc is only used for in-kernel CPU usage, e.g. the
contents being copied by the driver between vb2 buffers and some
hardware FIFO or other dedicated buffers. The memory does not go to
any hardware DMA.

Could you elaborate on what "the REQUIRED behavior is"? I can see that
both follow the get_vaddr_frames() -> frame_vector_to_pages() flow, as
you mentioned. Perhaps the only change needed is switching to
pin_user_pages after all?

>
> Guessing this hackery is for some embedded P2P DMA transfer?
>

That's not the case. There are two usage scenarios:

1) Regular user pointer support, i.e. using malloc()ed memory as a
video input or output buffer. This is deemed to be deprecated, but
still used quite widely by various applications. Quite common for
video decoder input and video encoder output, because the other side
of the pipeline is CPU, e.g. network stack.

2) Sharing carveout buffers (without struct pages attached) between
two devices. I believe this predates DMA-buf and CMA (which enabled
carveouts with struct pages) and I'm not sure how commonly it is used
these days.

> After he three places above should use pin_user_pages_fast(), then
> this whole broken API should be moved into videobuf2-memops.c and a
> big fat "THIS DOESN'T WORK" stuck on it.
>
> videobuf2 should probably use P2P DMA buf for this instead.

See above.

Best regards,
Tomasz
Jason Gunthorpe Oct. 7, 2020, 1:14 p.m. UTC | #32
On Wed, Oct 07, 2020 at 03:06:17PM +0200, Tomasz Figa wrote:

> Note that vb2_vmalloc is only used for in-kernel CPU usage, e.g. the
> contents being copied by the driver between vb2 buffers and some
> hardware FIFO or other dedicated buffers. The memory does not go to
> any hardware DMA.

That is even worse, the CPU can't just blindly touch VM_IO pages, that
isn't portable.

> Could you elaborate on what "the REQUIRED behavior is"? I can see that
> both follow the get_vaddr_frames() -> frame_vector_to_pages() flow, as
> you mentioned. Perhaps the only change needed is switching to
> pin_user_pages after all?

It is the comment right on top of get_vaddr_frames():

  if @start belongs to VM_IO | VM_PFNMAP vma, we don't
  touch page structures and the caller must make sure pfns aren't
  reused for anything else while he is using them.

Which means excluding every kind of VMA that is not something this
driver understands and then using special knowledge of the
driver-specific VMA to assure the above.

For instance if you could detect the VMA is from a carevout and do
something special like hold the fget() while knowning that the struct
file guarentees the carveout remains reserved - then you could use
follow_pfn.

But it would be faster and better to ask the carveout FD for the vaddr
range.

Jason
Tomasz Figa Oct. 7, 2020, 1:34 p.m. UTC | #33
On Wed, Oct 7, 2020 at 3:06 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Oct 07, 2020 at 02:58:33PM +0200, Daniel Vetter wrote:
> > On Wed, Oct 7, 2020 at 2:48 PM Tomasz Figa <tfiga@chromium.org> wrote:
> > >
> > > On Wed, Oct 7, 2020 at 2:44 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > >
> > > > On Wed, Oct 07, 2020 at 02:33:56PM +0200, Marek Szyprowski wrote:
> > > > > Well, it was in vb2_get_vma() function, but now I see that it has been
> > > > > lost in fb639eb39154 and 6690c8c78c74 some time ago...
> > > >
> > > > There is no guarentee that holding a get on the file says anthing
> > > > about the VMA. This needed to check that the file was some special
> > > > kind of file that promised the VMA layout and file lifetime are
> > > > connected.
> > > >
> > > > Also, cloning a VMA outside the mm world is just really bad. That
> > > > would screw up many assumptions the drivers make.
> > > >
> > > > If it is all obsolete I say we hide it behind a default n config
> > > > symbol and taint the kernel if anything uses it.
> > > >
> > > > Add a big comment above the follow_pfn to warn others away from this
> > > > code.
> > >
> > > Sadly it's just verbally declared as deprecated and not formally noted
> > > anyway. There are a lot of userspace applications relying on user
> > > pointer support.
> >
> > userptr can stay, it's the userptr abuse for zerocpy buffer sharing
> > which doesn't work anymore. At least without major surgery (you'd need
> > an mmu notifier to zap mappings and recreate them, and that pretty
> > much breaks the v4l model of preallocating all buffers to make sure we
> > never underflow the buffer queue). And static mappings are not coming
> > back I think, we'll go ever more into the direction of dynamic
> > mappings and moving stuff around as needed.
>
> Right, and to be clear, the last time I saw a security flaw of this
> magnitude from a subsystem badly mis-designing itself, Linus's
> knee-jerk reaction was to propose to remove the whole subsystem.
>
> Please don't take status-quo as acceptable, V4L community has to work
> to resolve this, uABI breakage or not. The follow_pfn related code
> must be compiled out of normal distro kernel builds.

I think the userptr zero-copy hack should be able to go away indeed,
given that we now have CMA that allows having carveouts backed by
struct pages and having the memory represented as DMA-buf normally.

How about the regular userptr use case, though?

The existing code resolves the user pointer into pages by following
the get_vaddr_frames() -> frame_vector_to_pages() ->
sg_alloc_table_from_pages() / vm_map_ram() approach.
get_vaddr_frames() seems to use pin_user_pages() behind the scenes if
the vma is not an IO or a PFNMAP, falling back to follow_pfn()
otherwise.

Is your intention to drop get_vaddr_frames() or we could still keep
using it and if vec->is_pfns is true:
a) if CONFIG_VIDEO_LEGACY_PFN_USERPTR is set, taint the kernel
b) otherwise just undo and fail?

Best regards,
Tomasz
Jason Gunthorpe Oct. 7, 2020, 1:42 p.m. UTC | #34
On Wed, Oct 07, 2020 at 03:34:01PM +0200, Tomasz Figa wrote:

> I think the userptr zero-copy hack should be able to go away indeed,
> given that we now have CMA that allows having carveouts backed by
> struct pages and having the memory represented as DMA-buf normally.

This also needs to figure out how to get references to CMA pages out
of a VMA. IIRC Daniel said these are not pinnable?

> How about the regular userptr use case, though?

Just call pin_user_pages(), that is the easy case.

> Is your intention to drop get_vaddr_frames() or we could still keep
> using it and if vec->is_pfns is true:

get_vaddr_frames() is dangerous, I would like it to go away.

> a) if CONFIG_VIDEO_LEGACY_PFN_USERPTR is set, taint the kernel
> b) otherwise just undo and fail?

For the CONFIG_VIDEO_LEGACY_PFN_USERPTR case all the follow_pfn
related code in get_vaddr_frames() shold move back into media and be
hidden under this config.

Jason
Daniel Vetter Oct. 7, 2020, 2:08 p.m. UTC | #35
On Wed, Oct 7, 2020 at 3:34 PM Tomasz Figa <tfiga@chromium.org> wrote:
>
> On Wed, Oct 7, 2020 at 3:06 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Wed, Oct 07, 2020 at 02:58:33PM +0200, Daniel Vetter wrote:
> > > On Wed, Oct 7, 2020 at 2:48 PM Tomasz Figa <tfiga@chromium.org> wrote:
> > > >
> > > > On Wed, Oct 7, 2020 at 2:44 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > >
> > > > > On Wed, Oct 07, 2020 at 02:33:56PM +0200, Marek Szyprowski wrote:
> > > > > > Well, it was in vb2_get_vma() function, but now I see that it has been
> > > > > > lost in fb639eb39154 and 6690c8c78c74 some time ago...
> > > > >
> > > > > There is no guarentee that holding a get on the file says anthing
> > > > > about the VMA. This needed to check that the file was some special
> > > > > kind of file that promised the VMA layout and file lifetime are
> > > > > connected.
> > > > >
> > > > > Also, cloning a VMA outside the mm world is just really bad. That
> > > > > would screw up many assumptions the drivers make.
> > > > >
> > > > > If it is all obsolete I say we hide it behind a default n config
> > > > > symbol and taint the kernel if anything uses it.
> > > > >
> > > > > Add a big comment above the follow_pfn to warn others away from this
> > > > > code.
> > > >
> > > > Sadly it's just verbally declared as deprecated and not formally noted
> > > > anyway. There are a lot of userspace applications relying on user
> > > > pointer support.
> > >
> > > userptr can stay, it's the userptr abuse for zerocpy buffer sharing
> > > which doesn't work anymore. At least without major surgery (you'd need
> > > an mmu notifier to zap mappings and recreate them, and that pretty
> > > much breaks the v4l model of preallocating all buffers to make sure we
> > > never underflow the buffer queue). And static mappings are not coming
> > > back I think, we'll go ever more into the direction of dynamic
> > > mappings and moving stuff around as needed.
> >
> > Right, and to be clear, the last time I saw a security flaw of this
> > magnitude from a subsystem badly mis-designing itself, Linus's
> > knee-jerk reaction was to propose to remove the whole subsystem.
> >
> > Please don't take status-quo as acceptable, V4L community has to work
> > to resolve this, uABI breakage or not. The follow_pfn related code
> > must be compiled out of normal distro kernel builds.
>
> I think the userptr zero-copy hack should be able to go away indeed,
> given that we now have CMA that allows having carveouts backed by
> struct pages and having the memory represented as DMA-buf normally.

Not sure whether there's a confusion here: dma-buf supports memory not
backed by struct page.

> How about the regular userptr use case, though?
>
> The existing code resolves the user pointer into pages by following
> the get_vaddr_frames() -> frame_vector_to_pages() ->
> sg_alloc_table_from_pages() / vm_map_ram() approach.
> get_vaddr_frames() seems to use pin_user_pages() behind the scenes if
> the vma is not an IO or a PFNMAP, falling back to follow_pfn()
> otherwise.

Yeah pin_user_pages is fine, it's just the VM_IO | VM_PFNMAP vma that
don't work.
>
> Is your intention to drop get_vaddr_frames() or we could still keep
> using it and if vec->is_pfns is true:
> a) if CONFIG_VIDEO_LEGACY_PFN_USERPTR is set, taint the kernel
> b) otherwise just undo and fail?

I'm typing that patch series (plus a pile more) right now.
-Daniel
Tomasz Figa Oct. 7, 2020, 2:11 p.m. UTC | #36
On Wed, Oct 7, 2020 at 4:09 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Wed, Oct 7, 2020 at 3:34 PM Tomasz Figa <tfiga@chromium.org> wrote:
> >
> > On Wed, Oct 7, 2020 at 3:06 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Wed, Oct 07, 2020 at 02:58:33PM +0200, Daniel Vetter wrote:
> > > > On Wed, Oct 7, 2020 at 2:48 PM Tomasz Figa <tfiga@chromium.org> wrote:
> > > > >
> > > > > On Wed, Oct 7, 2020 at 2:44 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > > >
> > > > > > On Wed, Oct 07, 2020 at 02:33:56PM +0200, Marek Szyprowski wrote:
> > > > > > > Well, it was in vb2_get_vma() function, but now I see that it has been
> > > > > > > lost in fb639eb39154 and 6690c8c78c74 some time ago...
> > > > > >
> > > > > > There is no guarentee that holding a get on the file says anthing
> > > > > > about the VMA. This needed to check that the file was some special
> > > > > > kind of file that promised the VMA layout and file lifetime are
> > > > > > connected.
> > > > > >
> > > > > > Also, cloning a VMA outside the mm world is just really bad. That
> > > > > > would screw up many assumptions the drivers make.
> > > > > >
> > > > > > If it is all obsolete I say we hide it behind a default n config
> > > > > > symbol and taint the kernel if anything uses it.
> > > > > >
> > > > > > Add a big comment above the follow_pfn to warn others away from this
> > > > > > code.
> > > > >
> > > > > Sadly it's just verbally declared as deprecated and not formally noted
> > > > > anyway. There are a lot of userspace applications relying on user
> > > > > pointer support.
> > > >
> > > > userptr can stay, it's the userptr abuse for zerocpy buffer sharing
> > > > which doesn't work anymore. At least without major surgery (you'd need
> > > > an mmu notifier to zap mappings and recreate them, and that pretty
> > > > much breaks the v4l model of preallocating all buffers to make sure we
> > > > never underflow the buffer queue). And static mappings are not coming
> > > > back I think, we'll go ever more into the direction of dynamic
> > > > mappings and moving stuff around as needed.
> > >
> > > Right, and to be clear, the last time I saw a security flaw of this
> > > magnitude from a subsystem badly mis-designing itself, Linus's
> > > knee-jerk reaction was to propose to remove the whole subsystem.
> > >
> > > Please don't take status-quo as acceptable, V4L community has to work
> > > to resolve this, uABI breakage or not. The follow_pfn related code
> > > must be compiled out of normal distro kernel builds.
> >
> > I think the userptr zero-copy hack should be able to go away indeed,
> > given that we now have CMA that allows having carveouts backed by
> > struct pages and having the memory represented as DMA-buf normally.
>
> Not sure whether there's a confusion here: dma-buf supports memory not
> backed by struct page.
>

That's new to me. The whole API relies on sg_tables a lot, which in
turn rely on struct page pointers to describe the physical memory.

> > How about the regular userptr use case, though?
> >
> > The existing code resolves the user pointer into pages by following
> > the get_vaddr_frames() -> frame_vector_to_pages() ->
> > sg_alloc_table_from_pages() / vm_map_ram() approach.
> > get_vaddr_frames() seems to use pin_user_pages() behind the scenes if
> > the vma is not an IO or a PFNMAP, falling back to follow_pfn()
> > otherwise.
>
> Yeah pin_user_pages is fine, it's just the VM_IO | VM_PFNMAP vma that
> don't work.

Ack.

> >
> > Is your intention to drop get_vaddr_frames() or we could still keep
> > using it and if vec->is_pfns is true:
> > a) if CONFIG_VIDEO_LEGACY_PFN_USERPTR is set, taint the kernel
> > b) otherwise just undo and fail?
>
> I'm typing that patch series (plus a pile more) right now.

Cool, thanks!

We also need to bring back the vma_open() that somehow disappeared
around 4.2, as Marek found.

Best regards,
Tomasz
Daniel Vetter Oct. 7, 2020, 2:22 p.m. UTC | #37
On Wed, Oct 7, 2020 at 4:12 PM Tomasz Figa <tfiga@chromium.org> wrote:
>
> On Wed, Oct 7, 2020 at 4:09 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> >
> > On Wed, Oct 7, 2020 at 3:34 PM Tomasz Figa <tfiga@chromium.org> wrote:
> > >
> > > On Wed, Oct 7, 2020 at 3:06 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > >
> > > > On Wed, Oct 07, 2020 at 02:58:33PM +0200, Daniel Vetter wrote:
> > > > > On Wed, Oct 7, 2020 at 2:48 PM Tomasz Figa <tfiga@chromium.org> wrote:
> > > > > >
> > > > > > On Wed, Oct 7, 2020 at 2:44 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > > > >
> > > > > > > On Wed, Oct 07, 2020 at 02:33:56PM +0200, Marek Szyprowski wrote:
> > > > > > > > Well, it was in vb2_get_vma() function, but now I see that it has been
> > > > > > > > lost in fb639eb39154 and 6690c8c78c74 some time ago...
> > > > > > >
> > > > > > > There is no guarentee that holding a get on the file says anthing
> > > > > > > about the VMA. This needed to check that the file was some special
> > > > > > > kind of file that promised the VMA layout and file lifetime are
> > > > > > > connected.
> > > > > > >
> > > > > > > Also, cloning a VMA outside the mm world is just really bad. That
> > > > > > > would screw up many assumptions the drivers make.
> > > > > > >
> > > > > > > If it is all obsolete I say we hide it behind a default n config
> > > > > > > symbol and taint the kernel if anything uses it.
> > > > > > >
> > > > > > > Add a big comment above the follow_pfn to warn others away from this
> > > > > > > code.
> > > > > >
> > > > > > Sadly it's just verbally declared as deprecated and not formally noted
> > > > > > anyway. There are a lot of userspace applications relying on user
> > > > > > pointer support.
> > > > >
> > > > > userptr can stay, it's the userptr abuse for zerocpy buffer sharing
> > > > > which doesn't work anymore. At least without major surgery (you'd need
> > > > > an mmu notifier to zap mappings and recreate them, and that pretty
> > > > > much breaks the v4l model of preallocating all buffers to make sure we
> > > > > never underflow the buffer queue). And static mappings are not coming
> > > > > back I think, we'll go ever more into the direction of dynamic
> > > > > mappings and moving stuff around as needed.
> > > >
> > > > Right, and to be clear, the last time I saw a security flaw of this
> > > > magnitude from a subsystem badly mis-designing itself, Linus's
> > > > knee-jerk reaction was to propose to remove the whole subsystem.
> > > >
> > > > Please don't take status-quo as acceptable, V4L community has to work
> > > > to resolve this, uABI breakage or not. The follow_pfn related code
> > > > must be compiled out of normal distro kernel builds.
> > >
> > > I think the userptr zero-copy hack should be able to go away indeed,
> > > given that we now have CMA that allows having carveouts backed by
> > > struct pages and having the memory represented as DMA-buf normally.
> >
> > Not sure whether there's a confusion here: dma-buf supports memory not
> > backed by struct page.
> >
>
> That's new to me. The whole API relies on sg_tables a lot, which in
> turn rely on struct page pointers to describe the physical memory.

You're not allowed to look at struct page pointers from the importer
side, those might not be there. Which isn't the prettiest thing, but
it works. And even if there's a struct page, you're still not allowed
to look at it, since it's fully managed by the exporter under whatever
rules that might need. So no touching it, ever.

This is also not news, supporting this was in the design brief from
the kickoff session 10+ years ago at some linaro connect thing (in
Budapest iirc). And we have implementations doing that for almost as
long merged in upstream.

> > > How about the regular userptr use case, though?
> > >
> > > The existing code resolves the user pointer into pages by following
> > > the get_vaddr_frames() -> frame_vector_to_pages() ->
> > > sg_alloc_table_from_pages() / vm_map_ram() approach.
> > > get_vaddr_frames() seems to use pin_user_pages() behind the scenes if
> > > the vma is not an IO or a PFNMAP, falling back to follow_pfn()
> > > otherwise.
> >
> > Yeah pin_user_pages is fine, it's just the VM_IO | VM_PFNMAP vma that
> > don't work.
>
> Ack.
>
> > >
> > > Is your intention to drop get_vaddr_frames() or we could still keep
> > > using it and if vec->is_pfns is true:
> > > a) if CONFIG_VIDEO_LEGACY_PFN_USERPTR is set, taint the kernel
> > > b) otherwise just undo and fail?
> >
> > I'm typing that patch series (plus a pile more) right now.
>
> Cool, thanks!
>
> We also need to bring back the vma_open() that somehow disappeared
> around 4.2, as Marek found.

The vm_open isn't enough to stop the problems (it doesn't and cannot
protect against unmap_mapping_range), I don't think keeping an
incomplete solution around has much benefit. People who need this can
disable CONFIG_STRICT_FOLLOW_PFN to keep things working, everyone else
probably doesn't want these mm internals leaking into the media
subsystem.
-Daniel
Jason Gunthorpe Oct. 7, 2020, 2:58 p.m. UTC | #38
On Wed, Oct 07, 2020 at 04:11:59PM +0200, Tomasz Figa wrote:

> We also need to bring back the vma_open() that somehow disappeared
> around 4.2, as Marek found.

No

Jason
Tomasz Figa Oct. 7, 2020, 3:05 p.m. UTC | #39
On Wed, Oct 7, 2020 at 4:23 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Wed, Oct 7, 2020 at 4:12 PM Tomasz Figa <tfiga@chromium.org> wrote:
> >
> > On Wed, Oct 7, 2020 at 4:09 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > >
> > > On Wed, Oct 7, 2020 at 3:34 PM Tomasz Figa <tfiga@chromium.org> wrote:
> > > >
> > > > On Wed, Oct 7, 2020 at 3:06 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > >
> > > > > On Wed, Oct 07, 2020 at 02:58:33PM +0200, Daniel Vetter wrote:
> > > > > > On Wed, Oct 7, 2020 at 2:48 PM Tomasz Figa <tfiga@chromium.org> wrote:
> > > > > > >
> > > > > > > On Wed, Oct 7, 2020 at 2:44 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > > > > >
> > > > > > > > On Wed, Oct 07, 2020 at 02:33:56PM +0200, Marek Szyprowski wrote:
> > > > > > > > > Well, it was in vb2_get_vma() function, but now I see that it has been
> > > > > > > > > lost in fb639eb39154 and 6690c8c78c74 some time ago...
> > > > > > > >
> > > > > > > > There is no guarentee that holding a get on the file says anthing
> > > > > > > > about the VMA. This needed to check that the file was some special
> > > > > > > > kind of file that promised the VMA layout and file lifetime are
> > > > > > > > connected.
> > > > > > > >
> > > > > > > > Also, cloning a VMA outside the mm world is just really bad. That
> > > > > > > > would screw up many assumptions the drivers make.
> > > > > > > >
> > > > > > > > If it is all obsolete I say we hide it behind a default n config
> > > > > > > > symbol and taint the kernel if anything uses it.
> > > > > > > >
> > > > > > > > Add a big comment above the follow_pfn to warn others away from this
> > > > > > > > code.
> > > > > > >
> > > > > > > Sadly it's just verbally declared as deprecated and not formally noted
> > > > > > > anyway. There are a lot of userspace applications relying on user
> > > > > > > pointer support.
> > > > > >
> > > > > > userptr can stay, it's the userptr abuse for zerocpy buffer sharing
> > > > > > which doesn't work anymore. At least without major surgery (you'd need
> > > > > > an mmu notifier to zap mappings and recreate them, and that pretty
> > > > > > much breaks the v4l model of preallocating all buffers to make sure we
> > > > > > never underflow the buffer queue). And static mappings are not coming
> > > > > > back I think, we'll go ever more into the direction of dynamic
> > > > > > mappings and moving stuff around as needed.
> > > > >
> > > > > Right, and to be clear, the last time I saw a security flaw of this
> > > > > magnitude from a subsystem badly mis-designing itself, Linus's
> > > > > knee-jerk reaction was to propose to remove the whole subsystem.
> > > > >
> > > > > Please don't take status-quo as acceptable, V4L community has to work
> > > > > to resolve this, uABI breakage or not. The follow_pfn related code
> > > > > must be compiled out of normal distro kernel builds.
> > > >
> > > > I think the userptr zero-copy hack should be able to go away indeed,
> > > > given that we now have CMA that allows having carveouts backed by
> > > > struct pages and having the memory represented as DMA-buf normally.
> > >
> > > Not sure whether there's a confusion here: dma-buf supports memory not
> > > backed by struct page.
> > >
> >
> > That's new to me. The whole API relies on sg_tables a lot, which in
> > turn rely on struct page pointers to describe the physical memory.
>
> You're not allowed to look at struct page pointers from the importer
> side, those might not be there. Which isn't the prettiest thing, but
> it works. And even if there's a struct page, you're still not allowed
> to look at it, since it's fully managed by the exporter under whatever
> rules that might need. So no touching it, ever.
>
> This is also not news, supporting this was in the design brief from
> the kickoff session 10+ years ago at some linaro connect thing (in
> Budapest iirc). And we have implementations doing that for almost as
> long merged in upstream.
>
> > > > How about the regular userptr use case, though?
> > > >
> > > > The existing code resolves the user pointer into pages by following
> > > > the get_vaddr_frames() -> frame_vector_to_pages() ->
> > > > sg_alloc_table_from_pages() / vm_map_ram() approach.
> > > > get_vaddr_frames() seems to use pin_user_pages() behind the scenes if
> > > > the vma is not an IO or a PFNMAP, falling back to follow_pfn()
> > > > otherwise.
> > >
> > > Yeah pin_user_pages is fine, it's just the VM_IO | VM_PFNMAP vma that
> > > don't work.
> >
> > Ack.
> >
> > > >
> > > > Is your intention to drop get_vaddr_frames() or we could still keep
> > > > using it and if vec->is_pfns is true:
> > > > a) if CONFIG_VIDEO_LEGACY_PFN_USERPTR is set, taint the kernel
> > > > b) otherwise just undo and fail?
> > >
> > > I'm typing that patch series (plus a pile more) right now.
> >
> > Cool, thanks!
> >
> > We also need to bring back the vma_open() that somehow disappeared
> > around 4.2, as Marek found.
>
> The vm_open isn't enough to stop the problems (it doesn't and cannot
> protect against unmap_mapping_range), I don't think keeping an
> incomplete solution around has much benefit. People who need this can
> disable CONFIG_STRICT_FOLLOW_PFN to keep things working, everyone else
> probably doesn't want these mm internals leaking into the media
> subsystem.

Okay, I think I mixed up the two different threads in the earlier
discussion. Since pin_user_pages() would atomically look up and pin
the underlying pages, we should be fine, so that vm_open() was only
needed for the bad follow_pfn path.

Best regards,
Tomasz
diff mbox series

Patch

diff --git a/mm/frame_vector.c b/mm/frame_vector.c
index 5d34c9047e9c..3507e09cb3ff 100644
--- a/mm/frame_vector.c
+++ b/mm/frame_vector.c
@@ -35,7 +35,7 @@  int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
-	unsigned int gup_flags = FOLL_WRITE | FOLL_FORCE;
+	unsigned int gup_flags = FOLL_WRITE | FOLL_FORCE | FOLL_LONGTERM;
 	int ret = 0;
 	int err;
 	int locked;