[07/13] mm: close race in generic_access_phys

Message ID	20201007164426.1812530-8-daniel.vetter@ffwll.ch (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=iJ/Y=DO=lists.freedesktop.org=dri-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1AFB9215A4 From: Daniel Vetter <daniel.vetter@ffwll.ch> To: DRI Development <dri-devel@lists.freedesktop.org>, LKML <linux-kernel@vger.kernel.org> Subject: [PATCH 07/13] mm: close race in generic_access_phys Date: Wed, 7 Oct 2020 18:44:20 +0200 Message-Id: <20201007164426.1812530-8-daniel.vetter@ffwll.ch> In-Reply-To: <20201007164426.1812530-1-daniel.vetter@ffwll.ch> References: <20201007164426.1812530-1-daniel.vetter@ffwll.ch> MIME-Version: 1.0 Precedence: list Cc: linux-s390@vger.kernel.org, Rik van Riel <riel@redhat.com>, linux-samsung-soc@vger.kernel.org, Jan Kara <jack@suse.cz>, Kees Cook <keescook@chromium.org>, kvm@vger.kernel.org, Jason Gunthorpe <jgg@ziepe.ca>, Dave Airlie <airlied@linux.ie>, Daniel Vetter <daniel.vetter@ffwll.ch>, Daniel Vetter <daniel.vetter@intel.com>, linux-mm@kvack.org, =?utf-8?b?SsOp?= =?utf-8?b?csO0bWUgR2xpc3Nl?= <jglisse@redhat.com>, John Hubbard <jhubbard@nvidia.com>, Hugh Dickins <hugh@veritas.com>, Dan Williams <dan.j.williams@intel.com>, Andrew Morton <akpm@linux-foundation.org>, linux-arm-kernel@lists.infradead.org, linux-media@vger.kernel.org Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	follow_pfn and other iomap races \| expand [00/13] follow_pfn and other iomap races [01/13] drm/exynos: Stop using frame_vector helpers [02/13] drm/exynos: Use FOLL_LONGTERM for g2d cmdlists [03/13] misc/habana: Stop using frame_vector helpers [04/13] misc/habana: Use FOLL_LONGTERM for userptr [05/13] mm/frame-vector: Use FOLL_LONGTERM [06/13] media: videobuf2: Move frame_vector into media subsystem [07/13] mm: close race in generic_access_phys [08/13] s390/pci: Remove races against pte updates [09/13] PCI: obey iomem restrictions for procfs mmap [10/13] PCI: revoke mappings like devmem [11/13] mm: add unsafe_follow_pfn [12/13] media/videbuf1\|2: Mark follow_pfn usage as unsafe [13/13] vfio/type1: Mark follow_pfn as unsafe

Daniel Vetter Oct. 7, 2020, 4:44 p.m. UTC

Way back it was a reasonable assumptions that iomem mappings never
change the pfn range they point at. But this has changed:

- gpu drivers dynamically manage their memory nowadays, invalidating
  ptes with unmap_mapping_range when buffers get moved

- contiguous dma allocations have moved from dedicated carvetouts to
  cma regions. This means if we miss the unmap the pfn might contain
  pagecache or anon memory (well anything allocated with GFP_MOVEABLE)

- even /dev/mem now invalidates mappings when the kernel requests that
  iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87
  ("/dev/mem: Revoke mappings when a driver claims the region")

Accessing pfns obtained from ptes without holding all the locks is
therefore no longer a good idea. Fix this.

Since ioremap might need to manipulate pagetables too we need to drop
the pt lock and have a retry loop if we raced.

While at it, also add kerneldoc and improve the comment for the
vma_ops->access function. It's for accessing, not for moving the
memory from iomem to system memory, as the old comment seemed to
suggest.

References: 28b2ee20c7cb ("access_process_vm device memory infrastructure")
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Benjamin Herrensmidt <benh@kernel.crashing.org>
Cc: Dave Airlie <airlied@linux.ie>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-samsung-soc@vger.kernel.org
Cc: linux-media@vger.kernel.org
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 include/linux/mm.h |  3 ++-
 mm/memory.c        | 44 ++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 44 insertions(+), 3 deletions(-)

Jason Gunthorpe Oct. 7, 2020, 5:27 p.m. UTC | #1

On Wed, Oct 07, 2020 at 06:44:20PM +0200, Daniel Vetter wrote:
> Way back it was a reasonable assumptions that iomem mappings never
> change the pfn range they point at. But this has changed:
> 
> - gpu drivers dynamically manage their memory nowadays, invalidating
>   ptes with unmap_mapping_range when buffers get moved
> 
> - contiguous dma allocations have moved from dedicated carvetouts to
>   cma regions. This means if we miss the unmap the pfn might contain
>   pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
> 
> - even /dev/mem now invalidates mappings when the kernel requests that
>   iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87
>   ("/dev/mem: Revoke mappings when a driver claims the region")
> 
> Accessing pfns obtained from ptes without holding all the locks is
> therefore no longer a good idea. Fix this.
> 
> Since ioremap might need to manipulate pagetables too we need to drop
> the pt lock and have a retry loop if we raced.
> 
> While at it, also add kerneldoc and improve the comment for the
> vma_ops->access function. It's for accessing, not for moving the
> memory from iomem to system memory, as the old comment seemed to
> suggest.
> 
> References: 28b2ee20c7cb ("access_process_vm device memory infrastructure")
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Benjamin Herrensmidt <benh@kernel.crashing.org>
> Cc: Dave Airlie <airlied@linux.ie>
> Cc: Hugh Dickins <hugh@veritas.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Jérôme Glisse <jglisse@redhat.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: linux-mm@kvack.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-samsung-soc@vger.kernel.org
> Cc: linux-media@vger.kernel.org
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  include/linux/mm.h |  3 ++-
>  mm/memory.c        | 44 ++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 44 insertions(+), 3 deletions(-)

This does seem to solve the race with revoke_devmem(), but it is really ugly.

It would be much nicer to wrap a rwsem around this access and the unmap.

Any place using it has a nice linear translation from vm_off to pfn,
so I don't think there is a such a good reason to use follow_pte in
the first place.

ie why not the helper be this:

 int generic_access_phys(unsigned long pfn, unsigned long pgprot,
      void *buf, size_t len, bool write)

Then something like dev/mem would compute pfn and obtain the lock:

dev_access(struct vm_area_struct *vma, unsigned long addr, void *buf, int len, int write)
{
     cpu_addr = vma->vm_pgoff*PAGE_SIZE + (addr - vma->vm_start));

     /* FIXME: Has to be over each page of len */
     if (!devmem_is_allowed_access(PHYS_PFN(cpu_addr/4096)))
           return -EPERM;

     down_read(&mem_sem);
     generic_access_phys(cpu_addr/4096, pgprot_val(vma->vm_page_prot),
                         buf, len, write);
     up_read(&mem_sem);
}

The other cases looked simpler because they don't revoke, here the
mmap_sem alone should be enough protection, they would just need to
provide the linear translation to pfn.

What do you think?

Jason

Daniel Vetter Oct. 7, 2020, 6:01 p.m. UTC | #2

On Wed, Oct 7, 2020 at 7:27 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Oct 07, 2020 at 06:44:20PM +0200, Daniel Vetter wrote:
> > Way back it was a reasonable assumptions that iomem mappings never
> > change the pfn range they point at. But this has changed:
> >
> > - gpu drivers dynamically manage their memory nowadays, invalidating
> >   ptes with unmap_mapping_range when buffers get moved
> >
> > - contiguous dma allocations have moved from dedicated carvetouts to
> >   cma regions. This means if we miss the unmap the pfn might contain
> >   pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
> >
> > - even /dev/mem now invalidates mappings when the kernel requests that
> >   iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87
> >   ("/dev/mem: Revoke mappings when a driver claims the region")
> >
> > Accessing pfns obtained from ptes without holding all the locks is
> > therefore no longer a good idea. Fix this.
> >
> > Since ioremap might need to manipulate pagetables too we need to drop
> > the pt lock and have a retry loop if we raced.
> >
> > While at it, also add kerneldoc and improve the comment for the
> > vma_ops->access function. It's for accessing, not for moving the
> > memory from iomem to system memory, as the old comment seemed to
> > suggest.
> >
> > References: 28b2ee20c7cb ("access_process_vm device memory infrastructure")
> > Cc: Jason Gunthorpe <jgg@ziepe.ca>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Benjamin Herrensmidt <benh@kernel.crashing.org>
> > Cc: Dave Airlie <airlied@linux.ie>
> > Cc: Hugh Dickins <hugh@veritas.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > Cc: Jérôme Glisse <jglisse@redhat.com>
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: linux-mm@kvack.org
> > Cc: linux-arm-kernel@lists.infradead.org
> > Cc: linux-samsung-soc@vger.kernel.org
> > Cc: linux-media@vger.kernel.org
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > ---
> >  include/linux/mm.h |  3 ++-
> >  mm/memory.c        | 44 ++++++++++++++++++++++++++++++++++++++++++--
> >  2 files changed, 44 insertions(+), 3 deletions(-)
>
> This does seem to solve the race with revoke_devmem(), but it is really ugly.
>
> It would be much nicer to wrap a rwsem around this access and the unmap.
>
> Any place using it has a nice linear translation from vm_off to pfn,
> so I don't think there is a such a good reason to use follow_pte in
> the first place.
>
> ie why not the helper be this:
>
>  int generic_access_phys(unsigned long pfn, unsigned long pgprot,
>       void *buf, size_t len, bool write)
>
> Then something like dev/mem would compute pfn and obtain the lock:
>
> dev_access(struct vm_area_struct *vma, unsigned long addr, void *buf, int len, int write)
> {
>      cpu_addr = vma->vm_pgoff*PAGE_SIZE + (addr - vma->vm_start));
>
>      /* FIXME: Has to be over each page of len */
>      if (!devmem_is_allowed_access(PHYS_PFN(cpu_addr/4096)))
>            return -EPERM;
>
>      down_read(&mem_sem);
>      generic_access_phys(cpu_addr/4096, pgprot_val(vma->vm_page_prot),
>                          buf, len, write);
>      up_read(&mem_sem);
> }
>
> The other cases looked simpler because they don't revoke, here the
> mmap_sem alone should be enough protection, they would just need to
> provide the linear translation to pfn.
>
> What do you think?

I think it'd fix the bug, until someone wires ->access up for
drivers/gpu, or the next subsystem. This is also just for ptrace, so
we really don't care when we stall the vm badly and other silly
things. So I figured the somewhat ugly, but full generic solution is
the better one, so that people who want to be able to ptrace
read/write their iomem mmaps can just sprinkle this wherever they feel
like.

But yeah if we go with most minimal fix, i.e. only trying to fix the
current users, then your thing should work and is simpler. But it
leaves the door open for future problems.
-Daniel

Jason Gunthorpe Oct. 7, 2020, 11:21 p.m. UTC | #3

On Wed, Oct 07, 2020 at 08:01:42PM +0200, Daniel Vetter wrote:
> I think it'd fix the bug, until someone wires ->access up for
> drivers/gpu, or the next subsystem. This is also just for ptrace, so
> we really don't care when we stall the vm badly and other silly
> things. So I figured the somewhat ugly, but full generic solution is
> the better one, so that people who want to be able to ptrace
> read/write their iomem mmaps can just sprinkle this wherever they feel
> like.
> 
> But yeah if we go with most minimal fix, i.e. only trying to fix the
> current users, then your thing should work and is simpler. But it
> leaves the door open for future problems.

The only other idea I had was to fully make the 'vma of __iomem
memory' some generic utility, completely take over the vm_ops.

We did something like this in RDMA, what I found was even just
implementing mmap() using the kernel helpers turned out to be pretty
tricky, many drivers did it wrong in small ways.

Jason

John Hubbard Oct. 8, 2020, 12:44 a.m. UTC | #4

On 10/7/20 9:44 AM, Daniel Vetter wrote:
> Way back it was a reasonable assumptions that iomem mappings never
> change the pfn range they point at. But this has changed:
> 
> - gpu drivers dynamically manage their memory nowadays, invalidating
>    ptes with unmap_mapping_range when buffers get moved
> 
> - contiguous dma allocations have moved from dedicated carvetouts to

s/carvetouts/carveouts/

>    cma regions. This means if we miss the unmap the pfn might contain
>    pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
> 
> - even /dev/mem now invalidates mappings when the kernel requests that
>    iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87
>    ("/dev/mem: Revoke mappings when a driver claims the region")

Thanks for putting these references into the log, it's very helpful.
...
> diff --git a/mm/memory.c b/mm/memory.c
> index fcfc4ca36eba..8d467e23b44e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4873,28 +4873,68 @@ int follow_phys(struct vm_area_struct *vma,
>   	return ret;
>   }
>   
> +/**
> + * generic_access_phys - generic implementation for iomem mmap access
> + * @vma: the vma to access
> + * @addr: userspace addres, not relative offset within @vma
> + * @buf: buffer to read/write
> + * @len: length of transfer
> + * @write: set to FOLL_WRITE when writing, otherwise reading
> + *
> + * This is a generic implementation for &vm_operations_struct.access for an
> + * iomem mapping. This callback is used by access_process_vm() when the @vma is
> + * not page based.
> + */
>   int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
>   			void *buf, int len, int write)
>   {
>   	resource_size_t phys_addr;
>   	unsigned long prot = 0;
>   	void __iomem *maddr;
> +	pte_t *ptep, pte;
> +	spinlock_t *ptl;
>   	int offset = addr & (PAGE_SIZE-1);
> +	int ret = -EINVAL;
> +
> +	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
> +		return -EINVAL;
> +
> +retry:
> +	if (follow_pte(vma->vm_mm, addr, &ptep, &ptl))
> +		return -EINVAL;
> +	pte = *ptep;
> +	pte_unmap_unlock(ptep, ptl);
>   
> -	if (follow_phys(vma, addr, write, &prot, &phys_addr))
> +	prot = pgprot_val(pte_pgprot(pte));
> +	phys_addr = (resource_size_t)pte_pfn(pte) << PAGE_SHIFT;
> +
> +	if ((write & FOLL_WRITE) && !pte_write(pte))
>   		return -EINVAL;
>   
>   	maddr = ioremap_prot(phys_addr, PAGE_ALIGN(len + offset), prot);
>   	if (!maddr)
>   		return -ENOMEM;
>   
> +	if (follow_pte(vma->vm_mm, addr, &ptep, &ptl))
> +		goto out_unmap;
> +
> +	if (pte_same(pte, *ptep)) {


The ioremap area is something I'm sorta new to, so a newbie question:
is it possible for the same pte to already be there, ever? If so, we
be stuck in an infinite loop here.  I'm sure that's not the case, but
it's not yet obvious to me why it's impossible. Resource reservations
maybe?


thanks,

Daniel Vetter Oct. 8, 2020, 7:23 a.m. UTC | #5

On Thu, Oct 8, 2020 at 2:44 AM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 10/7/20 9:44 AM, Daniel Vetter wrote:
> > Way back it was a reasonable assumptions that iomem mappings never
> > change the pfn range they point at. But this has changed:
> >
> > - gpu drivers dynamically manage their memory nowadays, invalidating
> >    ptes with unmap_mapping_range when buffers get moved
> >
> > - contiguous dma allocations have moved from dedicated carvetouts to
>
> s/carvetouts/carveouts/
>
> >    cma regions. This means if we miss the unmap the pfn might contain
> >    pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
> >
> > - even /dev/mem now invalidates mappings when the kernel requests that
> >    iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87
> >    ("/dev/mem: Revoke mappings when a driver claims the region")
>
> Thanks for putting these references into the log, it's very helpful.
> ...
> > diff --git a/mm/memory.c b/mm/memory.c
> > index fcfc4ca36eba..8d467e23b44e 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4873,28 +4873,68 @@ int follow_phys(struct vm_area_struct *vma,
> >       return ret;
> >   }
> >
> > +/**
> > + * generic_access_phys - generic implementation for iomem mmap access
> > + * @vma: the vma to access
> > + * @addr: userspace addres, not relative offset within @vma
> > + * @buf: buffer to read/write
> > + * @len: length of transfer
> > + * @write: set to FOLL_WRITE when writing, otherwise reading
> > + *
> > + * This is a generic implementation for &vm_operations_struct.access for an
> > + * iomem mapping. This callback is used by access_process_vm() when the @vma is
> > + * not page based.
> > + */
> >   int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
> >                       void *buf, int len, int write)
> >   {
> >       resource_size_t phys_addr;
> >       unsigned long prot = 0;
> >       void __iomem *maddr;
> > +     pte_t *ptep, pte;
> > +     spinlock_t *ptl;
> >       int offset = addr & (PAGE_SIZE-1);
> > +     int ret = -EINVAL;
> > +
> > +     if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
> > +             return -EINVAL;
> > +
> > +retry:
> > +     if (follow_pte(vma->vm_mm, addr, &ptep, &ptl))
> > +             return -EINVAL;
> > +     pte = *ptep;
> > +     pte_unmap_unlock(ptep, ptl);
> >
> > -     if (follow_phys(vma, addr, write, &prot, &phys_addr))
> > +     prot = pgprot_val(pte_pgprot(pte));
> > +     phys_addr = (resource_size_t)pte_pfn(pte) << PAGE_SHIFT;
> > +
> > +     if ((write & FOLL_WRITE) && !pte_write(pte))
> >               return -EINVAL;
> >
> >       maddr = ioremap_prot(phys_addr, PAGE_ALIGN(len + offset), prot);
> >       if (!maddr)
> >               return -ENOMEM;
> >
> > +     if (follow_pte(vma->vm_mm, addr, &ptep, &ptl))
> > +             goto out_unmap;
> > +
> > +     if (pte_same(pte, *ptep)) {
>
>
> The ioremap area is something I'm sorta new to, so a newbie question:
> is it possible for the same pte to already be there, ever? If so, we
> be stuck in an infinite loop here.  I'm sure that's not the case, but
> it's not yet obvious to me why it's impossible. Resource reservations
> maybe?

It's just buggy, it should be !pte_same. And I need to figure out how
to test this I guess.
-Daniel

[07/13] mm: close race in generic_access_phys

Commit Message

Comments

Patch