Message ID | 20201021085655.1192025-13-daniel.vetter@ffwll.ch (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | follow_pfn and other iomap races | expand |
On Wed, Oct 21, 2020 at 10:56:51AM +0200, Daniel Vetter wrote: > There's three ways to access PCI BARs from userspace: /dev/mem, sysfs > files, and the old proc interface. Two check against > iomem_is_exclusive, proc never did. And with CONFIG_IO_STRICT_DEVMEM, > this starts to matter, since we don't want random userspace having > access to PCI BARs while a driver is loaded and using it. > > Fix this by adding the same iomem_is_exclusive() check we already have > on the sysfs side in pci_mmap_resource(). > > References: 90a545e98126 ("restrict /dev/mem to idle io memory ranges") > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> > Cc: Jason Gunthorpe <jgg@ziepe.ca> > Cc: Kees Cook <keescook@chromium.org> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: John Hubbard <jhubbard@nvidia.com> > Cc: Jérôme Glisse <jglisse@redhat.com> > Cc: Jan Kara <jack@suse.cz> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: linux-mm@kvack.org > Cc: linux-arm-kernel@lists.infradead.org > Cc: linux-samsung-soc@vger.kernel.org > Cc: linux-media@vger.kernel.org > Cc: Bjorn Helgaas <bhelgaas@google.com> > Cc: linux-pci@vger.kernel.org > Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.com> Maybe not for fixing in this series, but this access to IORESOURCE_BUSY doesn't have any locking. The write side holds the resource_lock at least.. > ret = pci_mmap_page_range(dev, i, vma, > fpriv->mmap_state, write_combine); At this point the vma isn't linked into the address space, so doesn't this happen? CPU 0 CPU1 mmap_region() vma = vm_area_alloc proc_bus_pci_mmap iomem_is_exclusive pci_mmap_page_range revoke_devmem unmap_mapping_range() // vma is not linked to the address space here, // unmap doesn't find it vma_link() !!! The VMA gets mapped with the revoked PTEs I couldn't find anything that prevents it at least, no mmap_sem on the unmap side, just the i_mmap_lock Not seeing how address space and pre-populating during mmap work together? Did I miss locking someplace? Not something to be fixed for this series, this is clearly an improvement, but seems like another problem to tackle? Jason
On Wed, Oct 21, 2020 at 2:50 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Wed, Oct 21, 2020 at 10:56:51AM +0200, Daniel Vetter wrote: > > There's three ways to access PCI BARs from userspace: /dev/mem, sysfs > > files, and the old proc interface. Two check against > > iomem_is_exclusive, proc never did. And with CONFIG_IO_STRICT_DEVMEM, > > this starts to matter, since we don't want random userspace having > > access to PCI BARs while a driver is loaded and using it. > > > > Fix this by adding the same iomem_is_exclusive() check we already have > > on the sysfs side in pci_mmap_resource(). > > > > References: 90a545e98126 ("restrict /dev/mem to idle io memory ranges") > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> > > Cc: Jason Gunthorpe <jgg@ziepe.ca> > > Cc: Kees Cook <keescook@chromium.org> > > Cc: Dan Williams <dan.j.williams@intel.com> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: John Hubbard <jhubbard@nvidia.com> > > Cc: Jérôme Glisse <jglisse@redhat.com> > > Cc: Jan Kara <jack@suse.cz> > > Cc: Dan Williams <dan.j.williams@intel.com> > > Cc: linux-mm@kvack.org > > Cc: linux-arm-kernel@lists.infradead.org > > Cc: linux-samsung-soc@vger.kernel.org > > Cc: linux-media@vger.kernel.org > > Cc: Bjorn Helgaas <bhelgaas@google.com> > > Cc: linux-pci@vger.kernel.org > > Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.com> > > Maybe not for fixing in this series, but this access to > IORESOURCE_BUSY doesn't have any locking. > > The write side holds the resource_lock at least.. > > > ret = pci_mmap_page_range(dev, i, vma, > > fpriv->mmap_state, write_combine); > > At this point the vma isn't linked into the address space, so doesn't > this happen? > > CPU 0 CPU1 > mmap_region() > vma = vm_area_alloc > proc_bus_pci_mmap > iomem_is_exclusive > pci_mmap_page_range > revoke_devmem > unmap_mapping_range() > // vma is not linked to the address space here, > // unmap doesn't find it > vma_link() > !!! The VMA gets mapped with the revoked PTEs > > I couldn't find anything that prevents it at least, no mmap_sem on the > unmap side, just the i_mmap_lock > > Not seeing how address space and pre-populating during mmap work > together? Did I miss locking someplace? > > Not something to be fixed for this series, this is clearly an > improvement, but seems like another problem to tackle? Uh yes. In drivers/gpu this isn't a problem because we only install ptes from the vm_ops->fault handler. So no races. And I don't think you can fix this otherwise through holding locks: mmap_sem we can't hold because before vma_link we don't even know which mm_struct is involved, so can't solve the race. Plus this would be worse that mm_take_all_locks used by mmu notifier. And the address_space i_mmap_lock is also no good since it's not held during the ->mmap callback, when we write the ptes. And the resource locks is even less useful, since we're not going to hold that at vma_link() time for sure. Hence delaying the pte writes after the vma_link, which means ->fault time, looks like the only way to close this gap. Trouble is I have no idea how to do this cleanly ... -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Wed, Oct 21, 2020 at 04:42:11PM +0200, Daniel Vetter wrote: > Uh yes. In drivers/gpu this isn't a problem because we only install > ptes from the vm_ops->fault handler. So no races. And I don't think > you can fix this otherwise through holding locks: mmap_sem we can't > hold because before vma_link we don't even know which mm_struct is > involved, so can't solve the race. Plus this would be worse that > mm_take_all_locks used by mmu notifier. And the address_space > i_mmap_lock is also no good since it's not held during the ->mmap > callback, when we write the ptes. And the resource locks is even less > useful, since we're not going to hold that at vma_link() time for > sure. > > Hence delaying the pte writes after the vma_link, which means ->fault > time, looks like the only way to close this gap. > Trouble is I have no idea how to do this cleanly ... How about add a vm_ops callback 'install_pages'/'prefault_pages' ? Call it after vm_link() - basically just move the remap_pfn, under some other lock, into there. Jason
On Wed, Oct 21, 2020 at 5:13 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Wed, Oct 21, 2020 at 04:42:11PM +0200, Daniel Vetter wrote: > > > Uh yes. In drivers/gpu this isn't a problem because we only install > > ptes from the vm_ops->fault handler. So no races. And I don't think > > you can fix this otherwise through holding locks: mmap_sem we can't > > hold because before vma_link we don't even know which mm_struct is > > involved, so can't solve the race. Plus this would be worse that > > mm_take_all_locks used by mmu notifier. And the address_space > > i_mmap_lock is also no good since it's not held during the ->mmap > > callback, when we write the ptes. And the resource locks is even less > > useful, since we're not going to hold that at vma_link() time for > > sure. > > > > Hence delaying the pte writes after the vma_link, which means ->fault > > time, looks like the only way to close this gap. > > > Trouble is I have no idea how to do this cleanly ... > > How about add a vm_ops callback 'install_pages'/'prefault_pages' ? > > Call it after vm_link() - basically just move the remap_pfn, under > some other lock, into there. Yeah, I think that would be useful. This might also be useful for something entirely different: For legacy fbdev emulation on top of drm kernel modesetting drivers we need to track dirty pages of VM_IO mmaps. Right now that's a gross hack, and essentially we just pay the price for entirely separate storage and an additional memcpy when this is needed to emulate fbdev mmap on top of drm. But if we have install_ptes callback or similar we could just wrap the native vm_ops and add a mkwrite callback on top for that dirty tracking. For that the hook would need to be after vm_set_page_prot so that we write-protect the ptes by default, since that's where we compute vma_wants_writenotify(). That's also after vma_link, so one hook for two use-cases. The trouble is that io_remap_pfn adjust vma->pgoff, so we'd need to split that. So ideally ->mmap would never set up any ptes. I guess one option would be if remap_pfn_range would steal the vma->vm_ops pointer for itself, then it could set up the correct ->install_ptes hook. But there's tons of callers for that, so not sure that's a bright idea. -Daniel
On Wed, Oct 21, 2020 at 05:54:54PM +0200, Daniel Vetter wrote: > The trouble is that io_remap_pfn adjust vma->pgoff, so we'd need to > split that. So ideally ->mmap would never set up any ptes. /dev/mem makes pgoff == pfn so it doesn't get changed by remap. pgoff doesn't get touched for MAP_SHARED either, so there are other users that could work like this - eg anyone mmaping IO memory is probably OK. > I guess one option would be if remap_pfn_range would steal the > vma->vm_ops pointer for itself, then it could set up the correct > ->install_ptes hook. But there's tons of callers for that, so not sure > that's a bright idea. The caller has to check that the mapping is still live, and I think hold a lock across the remap? Auto-defering it doesn't seem feasible. Jason
On Wed, Oct 21, 2020 at 6:37 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Wed, Oct 21, 2020 at 05:54:54PM +0200, Daniel Vetter wrote: > > > The trouble is that io_remap_pfn adjust vma->pgoff, so we'd need to > > split that. So ideally ->mmap would never set up any ptes. > > /dev/mem makes pgoff == pfn so it doesn't get changed by remap. > > pgoff doesn't get touched for MAP_SHARED either, so there are other > users that could work like this - eg anyone mmaping IO memory is > probably OK. I was more generally thinking for io_remap_pfn_users because of the mkwrite use-case we might have in fbdev emulation in drm. > > I guess one option would be if remap_pfn_range would steal the > > vma->vm_ops pointer for itself, then it could set up the correct > > ->install_ptes hook. But there's tons of callers for that, so not sure > > that's a bright idea. > > The caller has to check that the mapping is still live, and I think > hold a lock across the remap? Auto-defering it doesn't seem feasible. Right auto-defering reopens the race, so making this work automatically is a bit much. I guess just splitting this into a setup/install part and then doing the install of all the ptes at first fault should be good enough. We don't really need a new install_pages for that, just an io_remap_pfn_range that's split in two parts. -Daniel
On Wed, Oct 21, 2020 at 09:24:08PM +0200, Daniel Vetter wrote: > On Wed, Oct 21, 2020 at 6:37 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Wed, Oct 21, 2020 at 05:54:54PM +0200, Daniel Vetter wrote: > > > > > The trouble is that io_remap_pfn adjust vma->pgoff, so we'd need to > > > split that. So ideally ->mmap would never set up any ptes. > > > > /dev/mem makes pgoff == pfn so it doesn't get changed by remap. > > > > pgoff doesn't get touched for MAP_SHARED either, so there are other > > users that could work like this - eg anyone mmaping IO memory is > > probably OK. > > I was more generally thinking for io_remap_pfn_users because of the > mkwrite use-case we might have in fbdev emulation in drm. You have a use case for MAP_PRIVATE and io_remap_pfn_range()?? Jason
On Thu, Oct 22, 2020 at 1:20 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Wed, Oct 21, 2020 at 09:24:08PM +0200, Daniel Vetter wrote: > > On Wed, Oct 21, 2020 at 6:37 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > On Wed, Oct 21, 2020 at 05:54:54PM +0200, Daniel Vetter wrote: > > > > > > > The trouble is that io_remap_pfn adjust vma->pgoff, so we'd need to > > > > split that. So ideally ->mmap would never set up any ptes. > > > > > > /dev/mem makes pgoff == pfn so it doesn't get changed by remap. > > > > > > pgoff doesn't get touched for MAP_SHARED either, so there are other > > > users that could work like this - eg anyone mmaping IO memory is > > > probably OK. > > > > I was more generally thinking for io_remap_pfn_users because of the > > mkwrite use-case we might have in fbdev emulation in drm. > > You have a use case for MAP_PRIVATE and io_remap_pfn_range()?? Uh no :-) But for ioremaps and keep track of which pages userspace has touched. Problem is that there's many displays where you need to explicitly upload the data, and in drm we have ioctl calls for that. fbdev mmap assumes this just magically happens. So you need to keep track of write faults, launch a delayed worker which first re-protects all ptes and then uploads the dirty pages. And ideally we wouldn't have to implement this everywhere just for fbdev, but could wrap it around an existing mmap implementation by just intercepting mkwrite. -Daniel
On Thu, Oct 22, 2020 at 09:00:44AM +0200, Daniel Vetter wrote: > On Thu, Oct 22, 2020 at 1:20 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Wed, Oct 21, 2020 at 09:24:08PM +0200, Daniel Vetter wrote: > > > On Wed, Oct 21, 2020 at 6:37 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > > > On Wed, Oct 21, 2020 at 05:54:54PM +0200, Daniel Vetter wrote: > > > > > > > > > The trouble is that io_remap_pfn adjust vma->pgoff, so we'd need to > > > > > split that. So ideally ->mmap would never set up any ptes. > > > > > > > > /dev/mem makes pgoff == pfn so it doesn't get changed by remap. > > > > > > > > pgoff doesn't get touched for MAP_SHARED either, so there are other > > > > users that could work like this - eg anyone mmaping IO memory is > > > > probably OK. > > > > > > I was more generally thinking for io_remap_pfn_users because of the > > > mkwrite use-case we might have in fbdev emulation in drm. > > > > You have a use case for MAP_PRIVATE and io_remap_pfn_range()?? > > Uh no :-) So it is fine, the pgoff mangling only happens for MAP_PRIVATE Jason
On Thu, Oct 22, 2020 at 1:43 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Oct 22, 2020 at 09:00:44AM +0200, Daniel Vetter wrote: > > On Thu, Oct 22, 2020 at 1:20 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > On Wed, Oct 21, 2020 at 09:24:08PM +0200, Daniel Vetter wrote: > > > > On Wed, Oct 21, 2020 at 6:37 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > > > > > On Wed, Oct 21, 2020 at 05:54:54PM +0200, Daniel Vetter wrote: > > > > > > > > > > > The trouble is that io_remap_pfn adjust vma->pgoff, so we'd need to > > > > > > split that. So ideally ->mmap would never set up any ptes. > > > > > > > > > > /dev/mem makes pgoff == pfn so it doesn't get changed by remap. > > > > > > > > > > pgoff doesn't get touched for MAP_SHARED either, so there are other > > > > > users that could work like this - eg anyone mmaping IO memory is > > > > > probably OK. > > > > > > > > I was more generally thinking for io_remap_pfn_users because of the > > > > mkwrite use-case we might have in fbdev emulation in drm. > > > > > > You have a use case for MAP_PRIVATE and io_remap_pfn_range()?? > > > > Uh no :-) > > So it is fine, the pgoff mangling only happens for MAP_PRIVATE Ah right I got confused, thanks for clarifying. -Daniel
diff --git a/drivers/pci/proc.c b/drivers/pci/proc.c index d35186b01d98..3a2f90beb4cb 100644 --- a/drivers/pci/proc.c +++ b/drivers/pci/proc.c @@ -274,6 +274,11 @@ static int proc_bus_pci_mmap(struct file *file, struct vm_area_struct *vma) else return -EINVAL; } + + if (dev->resource[i].flags & IORESOURCE_MEM && + iomem_is_exclusive(dev->resource[i].start)) + return -EINVAL; + ret = pci_mmap_page_range(dev, i, vma, fpriv->mmap_state, write_combine); if (ret < 0)