diff mbox series

drm/shmem-helper: Don't remove the offset in vm_area_struct pgoff

Message ID 20210217165910.3820374-1-nroberts@igalia.com (mailing list archive)
State New, archived
Headers show
Series drm/shmem-helper: Don't remove the offset in vm_area_struct pgoff | expand

Commit Message

Neil Roberts Feb. 17, 2021, 4:59 p.m. UTC
When mmapping the shmem, it would previously adjust the pgoff in the
vm_area_struct to remove the fake offset that is added to be able to
identify the buffer. This patch removes the adjustment and makes the
fault handler use the vm_fault address to calculate the page offset
instead. Although using this address is apparently discouraged, several
DRM drivers seem to be doing it anyway.

The problem with removing the pgoff is that it prevents
drm_vma_node_unmap from working because that searches the mapping tree
by address. That doesn't work because all of the mappings are at offset
0. drm_vma_node_unmap is being used by the shmem helpers when purging
the buffer.

It looks like panfrost is using drm_gem_shmem_purge so this might fix a
potential bug there.

Signed-off-by: Neil Roberts <nroberts@igalia.com>
---
 drivers/gpu/drm/drm_gem_shmem_helper.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

Comments

kernel test robot Feb. 17, 2021, 7:33 p.m. UTC | #1
Hi Neil,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v5.11 next-20210217]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Neil-Roberts/drm-shmem-helper-Don-t-remove-the-offset-in-vm_area_struct-pgoff/20210218-012028
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git f40ddce88593482919761f74910f42f4b84c004b
config: i386-randconfig-m021-20210215 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

New smatch warnings:
drivers/gpu/drm/drm_gem_shmem_helper.c:534 drm_gem_shmem_fault() warn: unsigned 'page_offset' is never less than zero.

Old smatch warnings:
include/drm/drm_vma_manager.h:225 drm_vma_node_unmap() warn: should 'drm_vma_node_size(node) << 12' be a 64 bit type?

vim +/page_offset +534 drivers/gpu/drm/drm_gem_shmem_helper.c

   521	
   522	static vm_fault_t drm_gem_shmem_fault(struct vm_fault *vmf)
   523	{
   524		struct vm_area_struct *vma = vmf->vma;
   525		struct drm_gem_object *obj = vma->vm_private_data;
   526		struct drm_gem_shmem_object *shmem = to_drm_gem_shmem_obj(obj);
   527		loff_t num_pages = obj->size >> PAGE_SHIFT;
   528		struct page *page;
   529		pgoff_t page_offset;
   530	
   531		/* We don't use vmf->pgoff since that has the fake offset */
   532		page_offset = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
   533	
 > 534		if (page_offset < 0 || page_offset >= num_pages ||
   535		    WARN_ON_ONCE(!shmem->pages))
   536			return VM_FAULT_SIGBUS;
   537	
   538		page = shmem->pages[page_offset];
   539	
   540		return vmf_insert_page(vma, vmf->address, page);
   541	}
   542	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Steven Price Feb. 18, 2021, 3:26 p.m. UTC | #2
On 17/02/2021 16:59, Neil Roberts wrote:
> When mmapping the shmem, it would previously adjust the pgoff in the
> vm_area_struct to remove the fake offset that is added to be able to
> identify the buffer. This patch removes the adjustment and makes the
> fault handler use the vm_fault address to calculate the page offset
> instead. Although using this address is apparently discouraged, several
> DRM drivers seem to be doing it anyway.
> 
> The problem with removing the pgoff is that it prevents
> drm_vma_node_unmap from working because that searches the mapping tree
> by address. That doesn't work because all of the mappings are at offset
> 0. drm_vma_node_unmap is being used by the shmem helpers when purging
> the buffer.
> 
> It looks like panfrost is using drm_gem_shmem_purge so this might fix a
> potential bug there.
> 
> Signed-off-by: Neil Roberts <nroberts@igalia.com>

As the test robot points out pgoff_t is unsigned, so the <0 test makes 
no sense. But apart from that it looks good to me.

I think this is worth a "Fixes:" line too - as you point out 
drm_vma_node_unmap() won't be working correctly - which means we're 
potentially leaving user space with pages pointing to freed memory - not 
good! 17acb9f35ed7 is my best guess at the commit that introduced this.

Steve

> ---
>   drivers/gpu/drm/drm_gem_shmem_helper.c | 12 +++++++-----
>   1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c b/drivers/gpu/drm/drm_gem_shmem_helper.c
> index 9825c378dfa6..4b14157f1962 100644
> --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
> +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
> @@ -526,11 +526,16 @@ static vm_fault_t drm_gem_shmem_fault(struct vm_fault *vmf)
>   	struct drm_gem_shmem_object *shmem = to_drm_gem_shmem_obj(obj);
>   	loff_t num_pages = obj->size >> PAGE_SHIFT;
>   	struct page *page;
> +	pgoff_t page_offset;
>   
> -	if (vmf->pgoff >= num_pages || WARN_ON_ONCE(!shmem->pages))
> +	/* We don't use vmf->pgoff since that has the fake offset */
> +	page_offset = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
> +
> +	if (page_offset < 0 || page_offset >= num_pages ||
> +	    WARN_ON_ONCE(!shmem->pages))
>   		return VM_FAULT_SIGBUS;
>   
> -	page = shmem->pages[vmf->pgoff];
> +	page = shmem->pages[page_offset];
>   
>   	return vmf_insert_page(vma, vmf->address, page);
>   }
> @@ -581,9 +586,6 @@ int drm_gem_shmem_mmap(struct drm_gem_object *obj, struct vm_area_struct *vma)
>   	struct drm_gem_shmem_object *shmem;
>   	int ret;
>   
> -	/* Remove the fake offset */
> -	vma->vm_pgoff -= drm_vma_node_start(&obj->vma_node);
> -
>   	if (obj->import_attach) {
>   		/* Drop the reference drm_gem_mmap_obj() acquired.*/
>   		drm_gem_object_put(obj);
>
Daniel Vetter Feb. 18, 2021, 3:30 p.m. UTC | #3
On Thu, Feb 18, 2021 at 4:26 PM Steven Price <steven.price@arm.com> wrote:
>
> On 17/02/2021 16:59, Neil Roberts wrote:
> > When mmapping the shmem, it would previously adjust the pgoff in the
> > vm_area_struct to remove the fake offset that is added to be able to
> > identify the buffer. This patch removes the adjustment and makes the
> > fault handler use the vm_fault address to calculate the page offset
> > instead. Although using this address is apparently discouraged, several
> > DRM drivers seem to be doing it anyway.
> >
> > The problem with removing the pgoff is that it prevents
> > drm_vma_node_unmap from working because that searches the mapping tree
> > by address. That doesn't work because all of the mappings are at offset
> > 0. drm_vma_node_unmap is being used by the shmem helpers when purging
> > the buffer.
> >
> > It looks like panfrost is using drm_gem_shmem_purge so this might fix a
> > potential bug there.
> >
> > Signed-off-by: Neil Roberts <nroberts@igalia.com>
>
> As the test robot points out pgoff_t is unsigned, so the <0 test makes
> no sense. But apart from that it looks good to me.
>
> I think this is worth a "Fixes:" line too - as you point out
> drm_vma_node_unmap() won't be working correctly - which means we're
> potentially leaving user space with pages pointing to freed memory - not
> good! 17acb9f35ed7 is my best guess at the commit that introduced this.

Yeah plus Cc: stable for backporting and I think an igt or similar for
panfrost to check this works correctly would be pretty good too. Since
if it took us over 1 year to notice this bug it's pretty clear that
normal testing doesn't catch this. So very likely we'll break this
again.

btw for testing shrinkers recommended way is to have a debugfs file
that just force-shrinks everything. That way you avoid all the trouble
that tend to happen when you drive a system close to OOM on linux, and
it's also much faster.
-Daniel

> Steve
>
> > ---
> >   drivers/gpu/drm/drm_gem_shmem_helper.c | 12 +++++++-----
> >   1 file changed, 7 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c b/drivers/gpu/drm/drm_gem_shmem_helper.c
> > index 9825c378dfa6..4b14157f1962 100644
> > --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
> > +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
> > @@ -526,11 +526,16 @@ static vm_fault_t drm_gem_shmem_fault(struct vm_fault *vmf)
> >       struct drm_gem_shmem_object *shmem = to_drm_gem_shmem_obj(obj);
> >       loff_t num_pages = obj->size >> PAGE_SHIFT;
> >       struct page *page;
> > +     pgoff_t page_offset;
> >
> > -     if (vmf->pgoff >= num_pages || WARN_ON_ONCE(!shmem->pages))
> > +     /* We don't use vmf->pgoff since that has the fake offset */
> > +     page_offset = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
> > +
> > +     if (page_offset < 0 || page_offset >= num_pages ||
> > +         WARN_ON_ONCE(!shmem->pages))
> >               return VM_FAULT_SIGBUS;
> >
> > -     page = shmem->pages[vmf->pgoff];
> > +     page = shmem->pages[page_offset];
> >
> >       return vmf_insert_page(vma, vmf->address, page);
> >   }
> > @@ -581,9 +586,6 @@ int drm_gem_shmem_mmap(struct drm_gem_object *obj, struct vm_area_struct *vma)
> >       struct drm_gem_shmem_object *shmem;
> >       int ret;
> >
> > -     /* Remove the fake offset */
> > -     vma->vm_pgoff -= drm_vma_node_start(&obj->vma_node);
> > -
> >       if (obj->import_attach) {
> >               /* Drop the reference drm_gem_mmap_obj() acquired.*/
> >               drm_gem_object_put(obj);
> >
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Alyssa Rosenzweig Feb. 18, 2021, 3:45 p.m. UTC | #4
> Yeah plus Cc: stable for backporting and I think an igt or similar for
> panfrost to check this works correctly would be pretty good too. Since
> if it took us over 1 year to notice this bug it's pretty clear that
> normal testing doesn't catch this. So very likely we'll break this
> again.

Unfortunately there are a lot of kernel bugs which are noticed during actual
use (but not CI runs), some of which have never been fixed. I do know
the shrinker impl is buggy for us, if this is the fix I'm very happy.

> btw for testing shrinkers recommended way is to have a debugfs file
> that just force-shrinks everything. That way you avoid all the trouble
> that tend to happen when you drive a system close to OOM on linux, and
> it's also much faster.

2nding this as a good idea.
Steven Price Feb. 18, 2021, 4:15 p.m. UTC | #5
On 18/02/2021 15:45, Alyssa Rosenzweig wrote:
>> Yeah plus Cc: stable for backporting and I think an igt or similar for
>> panfrost to check this works correctly would be pretty good too. Since
>> if it took us over 1 year to notice this bug it's pretty clear that
>> normal testing doesn't catch this. So very likely we'll break this
>> again.
> 
> Unfortunately there are a lot of kernel bugs which are noticed during actual
> use (but not CI runs), some of which have never been fixed. I do know
> the shrinker impl is buggy for us, if this is the fix I'm very happy.

I doubt this will actually "fix" anything - if I understand correctly 
then the sequence which is broken is:

  * allocate BO, mmap to CPU
  * madvise(DONTNEED)
  * trigger purge
  * try to access the BO memory

which is an invalid sequence for user space - the attempt to access 
memory should cause a SIGSEGV. However because drm_vma_node_unmap() is 
unable to find the mappings there may still be page table entries 
present which would provide access to memory the kernel has freed. Which 
is of course a big security hole and so this fix is needed.

In what way do you find the shrinker impl buggy? I'm aware there's some 
dodgy locking (although I haven't worked out how to fix it) - but AFAICT 
it's more deadlock territory rather than lacking in locks. Are there 
correctness issues?

>> btw for testing shrinkers recommended way is to have a debugfs file
>> that just force-shrinks everything. That way you avoid all the trouble
>> that tend to happen when you drive a system close to OOM on linux, and
>> it's also much faster.
> 
> 2nding this as a good idea.
> 

Sounds like a good idea to me too. But equally I'm wondering whether the 
best (short term) solution is to actually disable the shrinker. I'm 
somewhat surprised that nobody has got fed up with the "Purging xxx 
bytes" message spam - which makes me think that most people are not 
hitting memory pressure to trigger the shrinker.

The shrinker on kbase caused a lot of grief - and the only way I managed 
to get that under control was by writing a static analysis tool for the 
locking, and by upsetting people by enforcing the (rather dumb) rules of 
the tool on the code base. I've been meaning to look at whether sparse 
can do a similar check of locks.

Sadly at the moment I'm struggling to find time to look at such things.

Steve
Rob Herring Feb. 18, 2021, 4:38 p.m. UTC | #6
On Thu, Feb 18, 2021 at 10:15 AM Steven Price <steven.price@arm.com> wrote:
>
> On 18/02/2021 15:45, Alyssa Rosenzweig wrote:
> >> Yeah plus Cc: stable for backporting and I think an igt or similar for
> >> panfrost to check this works correctly would be pretty good too. Since
> >> if it took us over 1 year to notice this bug it's pretty clear that
> >> normal testing doesn't catch this. So very likely we'll break this
> >> again.
> >
> > Unfortunately there are a lot of kernel bugs which are noticed during actual
> > use (but not CI runs), some of which have never been fixed. I do know
> > the shrinker impl is buggy for us, if this is the fix I'm very happy.
>
> I doubt this will actually "fix" anything - if I understand correctly
> then the sequence which is broken is:
>
>   * allocate BO, mmap to CPU
>   * madvise(DONTNEED)
>   * trigger purge
>   * try to access the BO memory
>
> which is an invalid sequence for user space - the attempt to access
> memory should cause a SIGSEGV. However because drm_vma_node_unmap() is
> unable to find the mappings there may still be page table entries
> present which would provide access to memory the kernel has freed. Which
> is of course a big security hole and so this fix is needed.
>
> In what way do you find the shrinker impl buggy? I'm aware there's some
> dodgy locking (although I haven't worked out how to fix it) - but AFAICT
> it's more deadlock territory rather than lacking in locks. Are there
> correctness issues?

What's there was largely a result of getting lockdep happy.

> >> btw for testing shrinkers recommended way is to have a debugfs file
> >> that just force-shrinks everything. That way you avoid all the trouble
> >> that tend to happen when you drive a system close to OOM on linux, and
> >> it's also much faster.
> >
> > 2nding this as a good idea.
> >
>
> Sounds like a good idea to me too. But equally I'm wondering whether the
> best (short term) solution is to actually disable the shrinker. I'm
> somewhat surprised that nobody has got fed up with the "Purging xxx
> bytes" message spam - which makes me think that most people are not
> hitting memory pressure to trigger the shrinker.

If the shrinker is dodgy, then it's probably good to have the messages
to know if it ran.

> The shrinker on kbase caused a lot of grief - and the only way I managed
> to get that under control was by writing a static analysis tool for the
> locking, and by upsetting people by enforcing the (rather dumb) rules of
> the tool on the code base. I've been meaning to look at whether sparse
> can do a similar check of locks.

Lockdep doesn't cover it?

Rob
Steven Price Feb. 18, 2021, 4:52 p.m. UTC | #7
On 18/02/2021 16:38, Rob Herring wrote:
> On Thu, Feb 18, 2021 at 10:15 AM Steven Price <steven.price@arm.com> wrote:
>>
>> On 18/02/2021 15:45, Alyssa Rosenzweig wrote:
>>>> Yeah plus Cc: stable for backporting and I think an igt or similar for
>>>> panfrost to check this works correctly would be pretty good too. Since
>>>> if it took us over 1 year to notice this bug it's pretty clear that
>>>> normal testing doesn't catch this. So very likely we'll break this
>>>> again.
>>>
>>> Unfortunately there are a lot of kernel bugs which are noticed during actual
>>> use (but not CI runs), some of which have never been fixed. I do know
>>> the shrinker impl is buggy for us, if this is the fix I'm very happy.
>>
>> I doubt this will actually "fix" anything - if I understand correctly
>> then the sequence which is broken is:
>>
>>    * allocate BO, mmap to CPU
>>    * madvise(DONTNEED)
>>    * trigger purge
>>    * try to access the BO memory
>>
>> which is an invalid sequence for user space - the attempt to access
>> memory should cause a SIGSEGV. However because drm_vma_node_unmap() is
>> unable to find the mappings there may still be page table entries
>> present which would provide access to memory the kernel has freed. Which
>> is of course a big security hole and so this fix is needed.
>>
>> In what way do you find the shrinker impl buggy? I'm aware there's some
>> dodgy locking (although I haven't worked out how to fix it) - but AFAICT
>> it's more deadlock territory rather than lacking in locks. Are there
>> correctness issues?
> 
> What's there was largely a result of getting lockdep happy.
> 
>>>> btw for testing shrinkers recommended way is to have a debugfs file
>>>> that just force-shrinks everything. That way you avoid all the trouble
>>>> that tend to happen when you drive a system close to OOM on linux, and
>>>> it's also much faster.
>>>
>>> 2nding this as a good idea.
>>>
>>
>> Sounds like a good idea to me too. But equally I'm wondering whether the
>> best (short term) solution is to actually disable the shrinker. I'm
>> somewhat surprised that nobody has got fed up with the "Purging xxx
>> bytes" message spam - which makes me think that most people are not
>> hitting memory pressure to trigger the shrinker.
> 
> If the shrinker is dodgy, then it's probably good to have the messages
> to know if it ran.
> 
>> The shrinker on kbase caused a lot of grief - and the only way I managed
>> to get that under control was by writing a static analysis tool for the
>> locking, and by upsetting people by enforcing the (rather dumb) rules of
>> the tool on the code base. I've been meaning to look at whether sparse
>> can do a similar check of locks.
> 
> Lockdep doesn't cover it?

Short answer: no ;)

The problem with lockdep is that you have to trigger the locking
scenario to get a warning out of it. For example you obviously won't get
any warnings about the shrinker without triggering the shrinker (which
means memory pressure since we don't have the debugfs file to trigger it).

I have to admit I'm not 100% sure I've seen any lockdep warnings based
on buffer objects recently. I can trigger them based on jobs:

-----8<------
[  265.764474] ======================================================
[  265.771380] WARNING: possible circular locking dependency detected
[  265.778294] 5.11.0-rc2+ #22 Tainted: G        W
[  265.784148] ------------------------------------------------------
[  265.791050] kworker/0:3/90 is trying to acquire lock:
[  265.796694] c0d982b0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x38
[  265.804994]
[  265.804994] but task is already holding lock:
[  265.811513] c49a348c (&js->queue[j].lock){+.+.}-{3:3}, at: panfrost_reset+0x124/0x1cc [panfrost]
[  265.821375]
[  265.821375] which lock already depends on the new lock.
[  265.821375]
[  265.830524]
[  265.830524] the existing dependency chain (in reverse order) is:
[  265.838892]
[  265.838892] -> #2 (&js->queue[j].lock){+.+.}-{3:3}:
[  265.845996]        mutex_lock_nested+0x18/0x20
[  265.850961]        panfrost_scheduler_stop+0x1c/0x94 [panfrost]
[  265.857590]        panfrost_reset+0x54/0x1cc [panfrost]
[  265.863441]        process_one_work+0x238/0x51c
[  265.868503]        worker_thread+0x22c/0x2e0
[  265.873270]        kthread+0x128/0x138
[  265.877455]        ret_from_fork+0x14/0x38
[  265.882028]        0x0
[  265.884657]
[  265.884657] -> #1 (dma_fence_map){++++}-{0:0}:
[  265.891277]        dma_resv_lockdep+0x1b4/0x290
[  265.896339]        do_one_initcall+0x5c/0x2e8
[  265.901206]        kernel_init_freeable+0x184/0x1d4
[  265.906651]        kernel_init+0x8/0x11c
[  265.911029]        ret_from_fork+0x14/0x38
[  265.915610]        0x0
[  265.918247]
[  265.918247] -> #0 (fs_reclaim){+.+.}-{0:0}:
[  265.924579]        lock_acquire+0x3a4/0x45c
[  265.929260]        __fs_reclaim_acquire+0x28/0x38
[  265.934523]        slab_pre_alloc_hook.constprop.28+0x1c/0x64
[  265.940948]        kmem_cache_alloc_trace+0x38/0x114
[  265.946493]        panfrost_job_run+0x60/0x2b4 [panfrost]
[  265.952540]        drm_sched_resubmit_jobs+0x88/0xc4 [gpu_sched]
[  265.959256]        panfrost_reset+0x174/0x1cc [panfrost]
[  265.965196]        process_one_work+0x238/0x51c
[  265.970259]        worker_thread+0x22c/0x2e0
[  265.975025]        kthread+0x128/0x138
[  265.979210]        ret_from_fork+0x14/0x38
[  265.983784]        0x0
[  265.986412]
[  265.986412] other info that might help us debug this:
[  265.986412]
[  265.995353] Chain exists of:
[  265.995353]   fs_reclaim --> dma_fence_map --> &js->queue[j].lock
[  265.995353]
[  266.007028]  Possible unsafe locking scenario:
[  266.007028]
[  266.013638]        CPU0                    CPU1
[  266.018694]        ----                    ----
[  266.023750]   lock(&js->queue[j].lock);
[  266.028033]                                lock(dma_fence_map);
[  266.034648]                                lock(&js->queue[j].lock);
[  266.041746]   lock(fs_reclaim);
[  266.045252]
[  266.045252]  *** DEADLOCK ***
[  266.045252]
[  266.051861] 4 locks held by kworker/0:3/90:
[  266.056530]  #0: c18096a8 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x18c/0x51c
[  266.066258]  #1: c46d5f28 ((work_completion)(&pfdev->reset.work)){+.+.}-{0:0}, at: process_one_work+0x18c/0x51c
[  266.077546]  #2: c0dfc5a0 (dma_fence_map){++++}-{0:0}, at: panfrost_reset+0x10/0x1cc [panfrost]
[  266.087294]  #3: c49a348c (&js->queue[j].lock){+.+.}-{3:3}, at: panfrost_reset+0x124/0x1cc [panfrost]
[  266.097624]
[  266.097624] stack backtrace:
[  266.102487] CPU: 0 PID: 90 Comm: kworker/0:3 Tainted: G        W         5.11.0-rc2+ #22
[  266.111529] Hardware name: Rockchip (Device Tree)
[  266.116773] Workqueue: events panfrost_reset [panfrost]
[  266.122628] [<c010f0c0>] (unwind_backtrace) from [<c010ad38>] (show_stack+0x10/0x14)
[  266.131288] [<c010ad38>] (show_stack) from [<c07c3c28>] (dump_stack+0xa4/0xd0)
[  266.139363] [<c07c3c28>] (dump_stack) from [<c0168760>] (check_noncircular+0x6c/0x90)
[  266.148116] [<c0168760>] (check_noncircular) from [<c016a2c0>] (__lock_acquire+0xe90/0x16a0)
[  266.157549] [<c016a2c0>] (__lock_acquire) from [<c016b5f8>] (lock_acquire+0x3a4/0x45c)
[  266.166399] [<c016b5f8>] (lock_acquire) from [<c0247b98>] (__fs_reclaim_acquire+0x28/0x38)
[  266.175637] [<c0247b98>] (__fs_reclaim_acquire) from [<c025445c>] (slab_pre_alloc_hook.constprop.28+0x1c/0x64)
[  266.186820] [<c025445c>] (slab_pre_alloc_hook.constprop.28) from [<c0255714>] (kmem_cache_alloc_trace+0x38/0x114)
[  266.198292] [<c0255714>] (kmem_cache_alloc_trace) from [<bf00d28c>] (panfrost_job_run+0x60/0x2b4 [panfrost])
[  266.209295] [<bf00d28c>] (panfrost_job_run [panfrost]) from [<bf00102c>] (drm_sched_resubmit_jobs+0x88/0xc4 [gpu_sched])
[  266.221476] [<bf00102c>] (drm_sched_resubmit_jobs [gpu_sched]) from [<bf00d0a0>] (panfrost_reset+0x174/0x1cc [panfrost])
[  266.233649] [<bf00d0a0>] (panfrost_reset [panfrost]) from [<c0139fd4>] (process_one_work+0x238/0x51c)
[  266.243974] [<c0139fd4>] (process_one_work) from [<c013aaa0>] (worker_thread+0x22c/0x2e0)
[  266.253115] [<c013aaa0>] (worker_thread) from [<c013fd64>] (kthread+0x128/0x138)
[  266.261382] [<c013fd64>] (kthread) from [<c010011c>] (ret_from_fork+0x14/0x38)
[  266.269456] Exception stack(0xc46d5fb0 to 0xc46d5ff8)
[  266.275098] 5fa0:                                     00000000 00000000 00000000 00000000
[  266.284236] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  266.293373] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000
-----8<------

And indeed a sleeping in atomic bug:
-----8<-----
[  289.672380] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:935
[  289.681210] rockchip-vop ff930000.vop: [drm:vop_crtc_atomic_flush] *ERROR* VOP vblank IRQ stuck for 10 ms
[  289.681880] in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 31, name: kworker/0:1
[  289.701901] INFO: lockdep is turned off.
[  289.706339] irq event stamp: 9414
[  289.710095] hardirqs last  enabled at (9413): [<c07cd1cc>] _raw_spin_unlock_irq+0x20/0x40
[  289.719381] hardirqs last disabled at (9414): [<c07c9140>] __schedule+0x170/0x698
[  289.727855] softirqs last  enabled at (9112): [<c077379c>] reg_todo+0x64/0x51c
[  289.736044] softirqs last disabled at (9110): [<c077377c>] reg_todo+0x44/0x51c
[  289.744233] CPU: 0 PID: 31 Comm: kworker/0:1 Tainted: G        W         5.11.0-rc2+ #22
[  289.753375] Hardware name: Rockchip (Device Tree)
[  289.758711] Workqueue: events panfrost_reset [panfrost]
[  289.764886] [<c010f0c0>] (unwind_backtrace) from [<c010ad38>] (show_stack+0x10/0x14)
[  289.773721] [<c010ad38>] (show_stack) from [<c07c3c28>] (dump_stack+0xa4/0xd0)
[  289.781948] [<c07c3c28>] (dump_stack) from [<c01490f8>] (___might_sleep+0x1bc/0x244)
[  289.790761] [<c01490f8>] (___might_sleep) from [<c07ca720>] (__mutex_lock+0x34/0x3c4)
[  289.799654] [<c07ca720>] (__mutex_lock) from [<c07caac8>] (mutex_lock_nested+0x18/0x20)
[  289.808732] [<c07caac8>] (mutex_lock_nested) from [<bf00b704>] (panfrost_gem_free_object+0x20/0x100 [panfrost])
[  289.820358] [<bf00b704>] (panfrost_gem_free_object [panfrost]) from [<bf00ccb0>] (kref_put+0x38/0x5c [panfrost])
[  289.832247] [<bf00ccb0>] (kref_put [panfrost]) from [<bf00ce0c>] (panfrost_job_cleanup+0x120/0x140 [panfrost])
[  289.843936] [<bf00ce0c>] (panfrost_job_cleanup [panfrost]) from [<bf00ccb0>] (kref_put+0x38/0x5c [panfrost])
[  289.855435] [<bf00ccb0>] (kref_put [panfrost]) from [<c054acb8>] (dma_fence_signal_timestamp_locked+0x1c0/0x1f8)
[  289.867163] [<c054acb8>] (dma_fence_signal_timestamp_locked) from [<c054b1f8>] (dma_fence_signal+0x38/0x58)
[  289.878207] [<c054b1f8>] (dma_fence_signal) from [<bf001388>] (drm_sched_job_done+0x58/0x148 [gpu_sched])
[  289.889237] [<bf001388>] (drm_sched_job_done [gpu_sched]) from [<bf001524>] (drm_sched_start+0xa4/0xd0 [gpu_sched])
[  289.901389] [<bf001524>] (drm_sched_start [gpu_sched]) from [<bf00d0ac>] (panfrost_reset+0x180/0x1cc [panfrost])
[  289.913286] [<bf00d0ac>] (panfrost_reset [panfrost]) from [<c0139fd4>] (process_one_work+0x238/0x51c)
[  289.923970] [<c0139fd4>] (process_one_work) from [<c013aaa0>] (worker_thread+0x22c/0x2e0)
[  289.933257] [<c013aaa0>] (worker_thread) from [<c013fd64>] (kthread+0x128/0x138)
[  289.941661] [<c013fd64>] (kthread) from [<c010011c>] (ret_from_fork+0x14/0x38)
[  289.949867] Exception stack(0xc2243fb0 to 0xc2243ff8)
[  289.955604] 3fa0:                                     00000000 00000000 00000000 00000000
[  289.964840] 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  289.974066] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
-----8<----

Certainly here the mutex causing the problem is the shrinker_lock!

The above is triggered by chucking a whole ton of jobs which
fault at the GPU.

Sadly I haven't found time to work out how to untangle the locks.

Steve
Rob Herring Feb. 18, 2021, 5:15 p.m. UTC | #8
On Thu, Feb 18, 2021 at 10:51 AM Steven Price <steven.price@arm.com> wrote:
>
> On 18/02/2021 16:38, Rob Herring wrote:
> > On Thu, Feb 18, 2021 at 10:15 AM Steven Price <steven.price@arm.com> wrote:
> >>
> >> On 18/02/2021 15:45, Alyssa Rosenzweig wrote:
> >>>> Yeah plus Cc: stable for backporting and I think an igt or similar for
> >>>> panfrost to check this works correctly would be pretty good too. Since
> >>>> if it took us over 1 year to notice this bug it's pretty clear that
> >>>> normal testing doesn't catch this. So very likely we'll break this
> >>>> again.
> >>>
> >>> Unfortunately there are a lot of kernel bugs which are noticed during actual
> >>> use (but not CI runs), some of which have never been fixed. I do know
> >>> the shrinker impl is buggy for us, if this is the fix I'm very happy.
> >>
> >> I doubt this will actually "fix" anything - if I understand correctly
> >> then the sequence which is broken is:
> >>
> >>    * allocate BO, mmap to CPU
> >>    * madvise(DONTNEED)
> >>    * trigger purge
> >>    * try to access the BO memory
> >>
> >> which is an invalid sequence for user space - the attempt to access
> >> memory should cause a SIGSEGV. However because drm_vma_node_unmap() is
> >> unable to find the mappings there may still be page table entries
> >> present which would provide access to memory the kernel has freed. Which
> >> is of course a big security hole and so this fix is needed.
> >>
> >> In what way do you find the shrinker impl buggy? I'm aware there's some
> >> dodgy locking (although I haven't worked out how to fix it) - but AFAICT
> >> it's more deadlock territory rather than lacking in locks. Are there
> >> correctness issues?
> >
> > What's there was largely a result of getting lockdep happy.
> >
> >>>> btw for testing shrinkers recommended way is to have a debugfs file
> >>>> that just force-shrinks everything. That way you avoid all the trouble
> >>>> that tend to happen when you drive a system close to OOM on linux, and
> >>>> it's also much faster.
> >>>
> >>> 2nding this as a good idea.
> >>>
> >>
> >> Sounds like a good idea to me too. But equally I'm wondering whether the
> >> best (short term) solution is to actually disable the shrinker. I'm
> >> somewhat surprised that nobody has got fed up with the "Purging xxx
> >> bytes" message spam - which makes me think that most people are not
> >> hitting memory pressure to trigger the shrinker.
> >
> > If the shrinker is dodgy, then it's probably good to have the messages
> > to know if it ran.
> >
> >> The shrinker on kbase caused a lot of grief - and the only way I managed
> >> to get that under control was by writing a static analysis tool for the
> >> locking, and by upsetting people by enforcing the (rather dumb) rules of
> >> the tool on the code base. I've been meaning to look at whether sparse
> >> can do a similar check of locks.
> >
> > Lockdep doesn't cover it?
>
> Short answer: no ;)
>
> The problem with lockdep is that you have to trigger the locking
> scenario to get a warning out of it. For example you obviously won't get
> any warnings about the shrinker without triggering the shrinker (which
> means memory pressure since we don't have the debugfs file to trigger it).

Actually, you don't need debugfs. Writing to /proc/sys/vm/drop_caches
will do it. Though maybe there's other code path scenarios that
wouldn't cover.


> I have to admit I'm not 100% sure I've seen any lockdep warnings based
> on buffer objects recently. I can trigger them based on jobs:
>
> -----8<------
> [  265.764474] ======================================================
> [  265.771380] WARNING: possible circular locking dependency detected
> [  265.778294] 5.11.0-rc2+ #22 Tainted: G        W
> [  265.784148] ------------------------------------------------------
> [  265.791050] kworker/0:3/90 is trying to acquire lock:
> [  265.796694] c0d982b0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x38
> [  265.804994]
> [  265.804994] but task is already holding lock:
> [  265.811513] c49a348c (&js->queue[j].lock){+.+.}-{3:3}, at: panfrost_reset+0x124/0x1cc [panfrost]
> [  265.821375]
> [  265.821375] which lock already depends on the new lock.
> [  265.821375]
> [  265.830524]
> [  265.830524] the existing dependency chain (in reverse order) is:
> [  265.838892]
> [  265.838892] -> #2 (&js->queue[j].lock){+.+.}-{3:3}:
> [  265.845996]        mutex_lock_nested+0x18/0x20
> [  265.850961]        panfrost_scheduler_stop+0x1c/0x94 [panfrost]
> [  265.857590]        panfrost_reset+0x54/0x1cc [panfrost]
> [  265.863441]        process_one_work+0x238/0x51c
> [  265.868503]        worker_thread+0x22c/0x2e0
> [  265.873270]        kthread+0x128/0x138
> [  265.877455]        ret_from_fork+0x14/0x38
> [  265.882028]        0x0
> [  265.884657]
> [  265.884657] -> #1 (dma_fence_map){++++}-{0:0}:
> [  265.891277]        dma_resv_lockdep+0x1b4/0x290
> [  265.896339]        do_one_initcall+0x5c/0x2e8
> [  265.901206]        kernel_init_freeable+0x184/0x1d4
> [  265.906651]        kernel_init+0x8/0x11c
> [  265.911029]        ret_from_fork+0x14/0x38
> [  265.915610]        0x0
> [  265.918247]
> [  265.918247] -> #0 (fs_reclaim){+.+.}-{0:0}:
> [  265.924579]        lock_acquire+0x3a4/0x45c
> [  265.929260]        __fs_reclaim_acquire+0x28/0x38
> [  265.934523]        slab_pre_alloc_hook.constprop.28+0x1c/0x64
> [  265.940948]        kmem_cache_alloc_trace+0x38/0x114
> [  265.946493]        panfrost_job_run+0x60/0x2b4 [panfrost]
> [  265.952540]        drm_sched_resubmit_jobs+0x88/0xc4 [gpu_sched]
> [  265.959256]        panfrost_reset+0x174/0x1cc [panfrost]
> [  265.965196]        process_one_work+0x238/0x51c
> [  265.970259]        worker_thread+0x22c/0x2e0
> [  265.975025]        kthread+0x128/0x138
> [  265.979210]        ret_from_fork+0x14/0x38
> [  265.983784]        0x0
> [  265.986412]
> [  265.986412] other info that might help us debug this:
> [  265.986412]
> [  265.995353] Chain exists of:
> [  265.995353]   fs_reclaim --> dma_fence_map --> &js->queue[j].lock
> [  265.995353]
> [  266.007028]  Possible unsafe locking scenario:
> [  266.007028]
> [  266.013638]        CPU0                    CPU1
> [  266.018694]        ----                    ----
> [  266.023750]   lock(&js->queue[j].lock);
> [  266.028033]                                lock(dma_fence_map);
> [  266.034648]                                lock(&js->queue[j].lock);
> [  266.041746]   lock(fs_reclaim);
> [  266.045252]
> [  266.045252]  *** DEADLOCK ***
> [  266.045252]
> [  266.051861] 4 locks held by kworker/0:3/90:
> [  266.056530]  #0: c18096a8 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x18c/0x51c
> [  266.066258]  #1: c46d5f28 ((work_completion)(&pfdev->reset.work)){+.+.}-{0:0}, at: process_one_work+0x18c/0x51c
> [  266.077546]  #2: c0dfc5a0 (dma_fence_map){++++}-{0:0}, at: panfrost_reset+0x10/0x1cc [panfrost]
> [  266.087294]  #3: c49a348c (&js->queue[j].lock){+.+.}-{3:3}, at: panfrost_reset+0x124/0x1cc [panfrost]
> [  266.097624]
> [  266.097624] stack backtrace:
> [  266.102487] CPU: 0 PID: 90 Comm: kworker/0:3 Tainted: G        W         5.11.0-rc2+ #22
> [  266.111529] Hardware name: Rockchip (Device Tree)
> [  266.116773] Workqueue: events panfrost_reset [panfrost]
> [  266.122628] [<c010f0c0>] (unwind_backtrace) from [<c010ad38>] (show_stack+0x10/0x14)
> [  266.131288] [<c010ad38>] (show_stack) from [<c07c3c28>] (dump_stack+0xa4/0xd0)
> [  266.139363] [<c07c3c28>] (dump_stack) from [<c0168760>] (check_noncircular+0x6c/0x90)
> [  266.148116] [<c0168760>] (check_noncircular) from [<c016a2c0>] (__lock_acquire+0xe90/0x16a0)
> [  266.157549] [<c016a2c0>] (__lock_acquire) from [<c016b5f8>] (lock_acquire+0x3a4/0x45c)
> [  266.166399] [<c016b5f8>] (lock_acquire) from [<c0247b98>] (__fs_reclaim_acquire+0x28/0x38)
> [  266.175637] [<c0247b98>] (__fs_reclaim_acquire) from [<c025445c>] (slab_pre_alloc_hook.constprop.28+0x1c/0x64)
> [  266.186820] [<c025445c>] (slab_pre_alloc_hook.constprop.28) from [<c0255714>] (kmem_cache_alloc_trace+0x38/0x114)
> [  266.198292] [<c0255714>] (kmem_cache_alloc_trace) from [<bf00d28c>] (panfrost_job_run+0x60/0x2b4 [panfrost])
> [  266.209295] [<bf00d28c>] (panfrost_job_run [panfrost]) from [<bf00102c>] (drm_sched_resubmit_jobs+0x88/0xc4 [gpu_sched])
> [  266.221476] [<bf00102c>] (drm_sched_resubmit_jobs [gpu_sched]) from [<bf00d0a0>] (panfrost_reset+0x174/0x1cc [panfrost])
> [  266.233649] [<bf00d0a0>] (panfrost_reset [panfrost]) from [<c0139fd4>] (process_one_work+0x238/0x51c)
> [  266.243974] [<c0139fd4>] (process_one_work) from [<c013aaa0>] (worker_thread+0x22c/0x2e0)
> [  266.253115] [<c013aaa0>] (worker_thread) from [<c013fd64>] (kthread+0x128/0x138)
> [  266.261382] [<c013fd64>] (kthread) from [<c010011c>] (ret_from_fork+0x14/0x38)
> [  266.269456] Exception stack(0xc46d5fb0 to 0xc46d5ff8)
> [  266.275098] 5fa0:                                     00000000 00000000 00000000 00000000
> [  266.284236] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> [  266.293373] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> -----8<------
>
> And indeed a sleeping in atomic bug:
> -----8<-----
> [  289.672380] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:935
> [  289.681210] rockchip-vop ff930000.vop: [drm:vop_crtc_atomic_flush] *ERROR* VOP vblank IRQ stuck for 10 ms
> [  289.681880] in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 31, name: kworker/0:1
> [  289.701901] INFO: lockdep is turned off.
> [  289.706339] irq event stamp: 9414
> [  289.710095] hardirqs last  enabled at (9413): [<c07cd1cc>] _raw_spin_unlock_irq+0x20/0x40
> [  289.719381] hardirqs last disabled at (9414): [<c07c9140>] __schedule+0x170/0x698
> [  289.727855] softirqs last  enabled at (9112): [<c077379c>] reg_todo+0x64/0x51c
> [  289.736044] softirqs last disabled at (9110): [<c077377c>] reg_todo+0x44/0x51c
> [  289.744233] CPU: 0 PID: 31 Comm: kworker/0:1 Tainted: G        W         5.11.0-rc2+ #22
> [  289.753375] Hardware name: Rockchip (Device Tree)
> [  289.758711] Workqueue: events panfrost_reset [panfrost]
> [  289.764886] [<c010f0c0>] (unwind_backtrace) from [<c010ad38>] (show_stack+0x10/0x14)
> [  289.773721] [<c010ad38>] (show_stack) from [<c07c3c28>] (dump_stack+0xa4/0xd0)
> [  289.781948] [<c07c3c28>] (dump_stack) from [<c01490f8>] (___might_sleep+0x1bc/0x244)
> [  289.790761] [<c01490f8>] (___might_sleep) from [<c07ca720>] (__mutex_lock+0x34/0x3c4)
> [  289.799654] [<c07ca720>] (__mutex_lock) from [<c07caac8>] (mutex_lock_nested+0x18/0x20)
> [  289.808732] [<c07caac8>] (mutex_lock_nested) from [<bf00b704>] (panfrost_gem_free_object+0x20/0x100 [panfrost])
> [  289.820358] [<bf00b704>] (panfrost_gem_free_object [panfrost]) from [<bf00ccb0>] (kref_put+0x38/0x5c [panfrost])
> [  289.832247] [<bf00ccb0>] (kref_put [panfrost]) from [<bf00ce0c>] (panfrost_job_cleanup+0x120/0x140 [panfrost])
> [  289.843936] [<bf00ce0c>] (panfrost_job_cleanup [panfrost]) from [<bf00ccb0>] (kref_put+0x38/0x5c [panfrost])
> [  289.855435] [<bf00ccb0>] (kref_put [panfrost]) from [<c054acb8>] (dma_fence_signal_timestamp_locked+0x1c0/0x1f8)
> [  289.867163] [<c054acb8>] (dma_fence_signal_timestamp_locked) from [<c054b1f8>] (dma_fence_signal+0x38/0x58)
> [  289.878207] [<c054b1f8>] (dma_fence_signal) from [<bf001388>] (drm_sched_job_done+0x58/0x148 [gpu_sched])
> [  289.889237] [<bf001388>] (drm_sched_job_done [gpu_sched]) from [<bf001524>] (drm_sched_start+0xa4/0xd0 [gpu_sched])
> [  289.901389] [<bf001524>] (drm_sched_start [gpu_sched]) from [<bf00d0ac>] (panfrost_reset+0x180/0x1cc [panfrost])
> [  289.913286] [<bf00d0ac>] (panfrost_reset [panfrost]) from [<c0139fd4>] (process_one_work+0x238/0x51c)
> [  289.923970] [<c0139fd4>] (process_one_work) from [<c013aaa0>] (worker_thread+0x22c/0x2e0)
> [  289.933257] [<c013aaa0>] (worker_thread) from [<c013fd64>] (kthread+0x128/0x138)
> [  289.941661] [<c013fd64>] (kthread) from [<c010011c>] (ret_from_fork+0x14/0x38)
> [  289.949867] Exception stack(0xc2243fb0 to 0xc2243ff8)
> [  289.955604] 3fa0:                                     00000000 00000000 00000000 00000000
> [  289.964840] 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> [  289.974066] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> -----8<----
>
> Certainly here the mutex causing the problem is the shrinker_lock!
>
> The above is triggered by chucking a whole ton of jobs which
> fault at the GPU.
>
> Sadly I haven't found time to work out how to untangle the locks.

They are tricky because pretty much any memory allocation can trigger
things as I recall.

Rob
Daniel Vetter Feb. 18, 2021, 6:20 p.m. UTC | #9
On Thu, Feb 18, 2021 at 6:16 PM Rob Herring <robh+dt@kernel.org> wrote:
>
> On Thu, Feb 18, 2021 at 10:51 AM Steven Price <steven.price@arm.com> wrote:
> >
> > On 18/02/2021 16:38, Rob Herring wrote:
> > > On Thu, Feb 18, 2021 at 10:15 AM Steven Price <steven.price@arm.com> wrote:
> > >>
> > >> On 18/02/2021 15:45, Alyssa Rosenzweig wrote:
> > >>>> Yeah plus Cc: stable for backporting and I think an igt or similar for
> > >>>> panfrost to check this works correctly would be pretty good too. Since
> > >>>> if it took us over 1 year to notice this bug it's pretty clear that
> > >>>> normal testing doesn't catch this. So very likely we'll break this
> > >>>> again.
> > >>>
> > >>> Unfortunately there are a lot of kernel bugs which are noticed during actual
> > >>> use (but not CI runs), some of which have never been fixed. I do know
> > >>> the shrinker impl is buggy for us, if this is the fix I'm very happy.
> > >>
> > >> I doubt this will actually "fix" anything - if I understand correctly
> > >> then the sequence which is broken is:
> > >>
> > >>    * allocate BO, mmap to CPU
> > >>    * madvise(DONTNEED)
> > >>    * trigger purge
> > >>    * try to access the BO memory
> > >>
> > >> which is an invalid sequence for user space - the attempt to access
> > >> memory should cause a SIGSEGV. However because drm_vma_node_unmap() is
> > >> unable to find the mappings there may still be page table entries
> > >> present which would provide access to memory the kernel has freed. Which
> > >> is of course a big security hole and so this fix is needed.
> > >>
> > >> In what way do you find the shrinker impl buggy? I'm aware there's some
> > >> dodgy locking (although I haven't worked out how to fix it) - but AFAICT
> > >> it's more deadlock territory rather than lacking in locks. Are there
> > >> correctness issues?
> > >
> > > What's there was largely a result of getting lockdep happy.
> > >
> > >>>> btw for testing shrinkers recommended way is to have a debugfs file
> > >>>> that just force-shrinks everything. That way you avoid all the trouble
> > >>>> that tend to happen when you drive a system close to OOM on linux, and
> > >>>> it's also much faster.
> > >>>
> > >>> 2nding this as a good idea.
> > >>>
> > >>
> > >> Sounds like a good idea to me too. But equally I'm wondering whether the
> > >> best (short term) solution is to actually disable the shrinker. I'm
> > >> somewhat surprised that nobody has got fed up with the "Purging xxx
> > >> bytes" message spam - which makes me think that most people are not
> > >> hitting memory pressure to trigger the shrinker.
> > >
> > > If the shrinker is dodgy, then it's probably good to have the messages
> > > to know if it ran.
> > >
> > >> The shrinker on kbase caused a lot of grief - and the only way I managed
> > >> to get that under control was by writing a static analysis tool for the
> > >> locking, and by upsetting people by enforcing the (rather dumb) rules of
> > >> the tool on the code base. I've been meaning to look at whether sparse
> > >> can do a similar check of locks.
> > >
> > > Lockdep doesn't cover it?
> >
> > Short answer: no ;)

It's pretty good actually, if you correctly annotate things up.

> > The problem with lockdep is that you have to trigger the locking
> > scenario to get a warning out of it. For example you obviously won't get
> > any warnings about the shrinker without triggering the shrinker (which
> > means memory pressure since we don't have the debugfs file to trigger it).
>
> Actually, you don't need debugfs. Writing to /proc/sys/vm/drop_caches
> will do it. Though maybe there's other code path scenarios that
> wouldn't cover.

Huh didn't know, but it's a bit a shotgun, plus it doesn't use
fs_reclaim shrinker annotations, which means you don't have lockdep
checks. I think at least, would need some deadlock and testing.
>
>
> > I have to admit I'm not 100% sure I've seen any lockdep warnings based
> > on buffer objects recently. I can trigger them based on jobs:
> >
> > -----8<------
> > [  265.764474] ======================================================
> > [  265.771380] WARNING: possible circular locking dependency detected
> > [  265.778294] 5.11.0-rc2+ #22 Tainted: G        W
> > [  265.784148] ------------------------------------------------------
> > [  265.791050] kworker/0:3/90 is trying to acquire lock:
> > [  265.796694] c0d982b0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x38
> > [  265.804994]
> > [  265.804994] but task is already holding lock:
> > [  265.811513] c49a348c (&js->queue[j].lock){+.+.}-{3:3}, at: panfrost_reset+0x124/0x1cc [panfrost]
> > [  265.821375]
> > [  265.821375] which lock already depends on the new lock.
> > [  265.821375]
> > [  265.830524]
> > [  265.830524] the existing dependency chain (in reverse order) is:
> > [  265.838892]
> > [  265.838892] -> #2 (&js->queue[j].lock){+.+.}-{3:3}:
> > [  265.845996]        mutex_lock_nested+0x18/0x20
> > [  265.850961]        panfrost_scheduler_stop+0x1c/0x94 [panfrost]
> > [  265.857590]        panfrost_reset+0x54/0x1cc [panfrost]
> > [  265.863441]        process_one_work+0x238/0x51c
> > [  265.868503]        worker_thread+0x22c/0x2e0
> > [  265.873270]        kthread+0x128/0x138
> > [  265.877455]        ret_from_fork+0x14/0x38
> > [  265.882028]        0x0
> > [  265.884657]
> > [  265.884657] -> #1 (dma_fence_map){++++}-{0:0}:
> > [  265.891277]        dma_resv_lockdep+0x1b4/0x290
> > [  265.896339]        do_one_initcall+0x5c/0x2e8
> > [  265.901206]        kernel_init_freeable+0x184/0x1d4
> > [  265.906651]        kernel_init+0x8/0x11c
> > [  265.911029]        ret_from_fork+0x14/0x38
> > [  265.915610]        0x0
> > [  265.918247]
> > [  265.918247] -> #0 (fs_reclaim){+.+.}-{0:0}:
> > [  265.924579]        lock_acquire+0x3a4/0x45c
> > [  265.929260]        __fs_reclaim_acquire+0x28/0x38
> > [  265.934523]        slab_pre_alloc_hook.constprop.28+0x1c/0x64
> > [  265.940948]        kmem_cache_alloc_trace+0x38/0x114
> > [  265.946493]        panfrost_job_run+0x60/0x2b4 [panfrost]
> > [  265.952540]        drm_sched_resubmit_jobs+0x88/0xc4 [gpu_sched]
> > [  265.959256]        panfrost_reset+0x174/0x1cc [panfrost]
> > [  265.965196]        process_one_work+0x238/0x51c
> > [  265.970259]        worker_thread+0x22c/0x2e0
> > [  265.975025]        kthread+0x128/0x138
> > [  265.979210]        ret_from_fork+0x14/0x38
> > [  265.983784]        0x0
> > [  265.986412]
> > [  265.986412] other info that might help us debug this:
> > [  265.986412]
> > [  265.995353] Chain exists of:
> > [  265.995353]   fs_reclaim --> dma_fence_map --> &js->queue[j].lock
> > [  265.995353]
> > [  266.007028]  Possible unsafe locking scenario:
> > [  266.007028]
> > [  266.013638]        CPU0                    CPU1
> > [  266.018694]        ----                    ----
> > [  266.023750]   lock(&js->queue[j].lock);
> > [  266.028033]                                lock(dma_fence_map);
> > [  266.034648]                                lock(&js->queue[j].lock);
> > [  266.041746]   lock(fs_reclaim);
> > [  266.045252]
> > [  266.045252]  *** DEADLOCK ***
> > [  266.045252]
> > [  266.051861] 4 locks held by kworker/0:3/90:
> > [  266.056530]  #0: c18096a8 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x18c/0x51c
> > [  266.066258]  #1: c46d5f28 ((work_completion)(&pfdev->reset.work)){+.+.}-{0:0}, at: process_one_work+0x18c/0x51c
> > [  266.077546]  #2: c0dfc5a0 (dma_fence_map){++++}-{0:0}, at: panfrost_reset+0x10/0x1cc [panfrost]
> > [  266.087294]  #3: c49a348c (&js->queue[j].lock){+.+.}-{3:3}, at: panfrost_reset+0x124/0x1cc [panfrost]
> > [  266.097624]
> > [  266.097624] stack backtrace:
> > [  266.102487] CPU: 0 PID: 90 Comm: kworker/0:3 Tainted: G        W         5.11.0-rc2+ #22
> > [  266.111529] Hardware name: Rockchip (Device Tree)
> > [  266.116773] Workqueue: events panfrost_reset [panfrost]
> > [  266.122628] [<c010f0c0>] (unwind_backtrace) from [<c010ad38>] (show_stack+0x10/0x14)
> > [  266.131288] [<c010ad38>] (show_stack) from [<c07c3c28>] (dump_stack+0xa4/0xd0)
> > [  266.139363] [<c07c3c28>] (dump_stack) from [<c0168760>] (check_noncircular+0x6c/0x90)
> > [  266.148116] [<c0168760>] (check_noncircular) from [<c016a2c0>] (__lock_acquire+0xe90/0x16a0)
> > [  266.157549] [<c016a2c0>] (__lock_acquire) from [<c016b5f8>] (lock_acquire+0x3a4/0x45c)
> > [  266.166399] [<c016b5f8>] (lock_acquire) from [<c0247b98>] (__fs_reclaim_acquire+0x28/0x38)
> > [  266.175637] [<c0247b98>] (__fs_reclaim_acquire) from [<c025445c>] (slab_pre_alloc_hook.constprop.28+0x1c/0x64)
> > [  266.186820] [<c025445c>] (slab_pre_alloc_hook.constprop.28) from [<c0255714>] (kmem_cache_alloc_trace+0x38/0x114)
> > [  266.198292] [<c0255714>] (kmem_cache_alloc_trace) from [<bf00d28c>] (panfrost_job_run+0x60/0x2b4 [panfrost])
> > [  266.209295] [<bf00d28c>] (panfrost_job_run [panfrost]) from [<bf00102c>] (drm_sched_resubmit_jobs+0x88/0xc4 [gpu_sched])
> > [  266.221476] [<bf00102c>] (drm_sched_resubmit_jobs [gpu_sched]) from [<bf00d0a0>] (panfrost_reset+0x174/0x1cc [panfrost])
> > [  266.233649] [<bf00d0a0>] (panfrost_reset [panfrost]) from [<c0139fd4>] (process_one_work+0x238/0x51c)
> > [  266.243974] [<c0139fd4>] (process_one_work) from [<c013aaa0>] (worker_thread+0x22c/0x2e0)
> > [  266.253115] [<c013aaa0>] (worker_thread) from [<c013fd64>] (kthread+0x128/0x138)
> > [  266.261382] [<c013fd64>] (kthread) from [<c010011c>] (ret_from_fork+0x14/0x38)
> > [  266.269456] Exception stack(0xc46d5fb0 to 0xc46d5ff8)
> > [  266.275098] 5fa0:                                     00000000 00000000 00000000 00000000
> > [  266.284236] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> > [  266.293373] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> > -----8<------
> >
> > And indeed a sleeping in atomic bug:
> > -----8<-----
> > [  289.672380] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:935
> > [  289.681210] rockchip-vop ff930000.vop: [drm:vop_crtc_atomic_flush] *ERROR* VOP vblank IRQ stuck for 10 ms
> > [  289.681880] in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 31, name: kworker/0:1
> > [  289.701901] INFO: lockdep is turned off.
> > [  289.706339] irq event stamp: 9414
> > [  289.710095] hardirqs last  enabled at (9413): [<c07cd1cc>] _raw_spin_unlock_irq+0x20/0x40
> > [  289.719381] hardirqs last disabled at (9414): [<c07c9140>] __schedule+0x170/0x698
> > [  289.727855] softirqs last  enabled at (9112): [<c077379c>] reg_todo+0x64/0x51c
> > [  289.736044] softirqs last disabled at (9110): [<c077377c>] reg_todo+0x44/0x51c
> > [  289.744233] CPU: 0 PID: 31 Comm: kworker/0:1 Tainted: G        W         5.11.0-rc2+ #22
> > [  289.753375] Hardware name: Rockchip (Device Tree)
> > [  289.758711] Workqueue: events panfrost_reset [panfrost]
> > [  289.764886] [<c010f0c0>] (unwind_backtrace) from [<c010ad38>] (show_stack+0x10/0x14)
> > [  289.773721] [<c010ad38>] (show_stack) from [<c07c3c28>] (dump_stack+0xa4/0xd0)
> > [  289.781948] [<c07c3c28>] (dump_stack) from [<c01490f8>] (___might_sleep+0x1bc/0x244)
> > [  289.790761] [<c01490f8>] (___might_sleep) from [<c07ca720>] (__mutex_lock+0x34/0x3c4)
> > [  289.799654] [<c07ca720>] (__mutex_lock) from [<c07caac8>] (mutex_lock_nested+0x18/0x20)
> > [  289.808732] [<c07caac8>] (mutex_lock_nested) from [<bf00b704>] (panfrost_gem_free_object+0x20/0x100 [panfrost])
> > [  289.820358] [<bf00b704>] (panfrost_gem_free_object [panfrost]) from [<bf00ccb0>] (kref_put+0x38/0x5c [panfrost])
> > [  289.832247] [<bf00ccb0>] (kref_put [panfrost]) from [<bf00ce0c>] (panfrost_job_cleanup+0x120/0x140 [panfrost])
> > [  289.843936] [<bf00ce0c>] (panfrost_job_cleanup [panfrost]) from [<bf00ccb0>] (kref_put+0x38/0x5c [panfrost])
> > [  289.855435] [<bf00ccb0>] (kref_put [panfrost]) from [<c054acb8>] (dma_fence_signal_timestamp_locked+0x1c0/0x1f8)
> > [  289.867163] [<c054acb8>] (dma_fence_signal_timestamp_locked) from [<c054b1f8>] (dma_fence_signal+0x38/0x58)
> > [  289.878207] [<c054b1f8>] (dma_fence_signal) from [<bf001388>] (drm_sched_job_done+0x58/0x148 [gpu_sched])
> > [  289.889237] [<bf001388>] (drm_sched_job_done [gpu_sched]) from [<bf001524>] (drm_sched_start+0xa4/0xd0 [gpu_sched])
> > [  289.901389] [<bf001524>] (drm_sched_start [gpu_sched]) from [<bf00d0ac>] (panfrost_reset+0x180/0x1cc [panfrost])
> > [  289.913286] [<bf00d0ac>] (panfrost_reset [panfrost]) from [<c0139fd4>] (process_one_work+0x238/0x51c)
> > [  289.923970] [<c0139fd4>] (process_one_work) from [<c013aaa0>] (worker_thread+0x22c/0x2e0)
> > [  289.933257] [<c013aaa0>] (worker_thread) from [<c013fd64>] (kthread+0x128/0x138)
> > [  289.941661] [<c013fd64>] (kthread) from [<c010011c>] (ret_from_fork+0x14/0x38)
> > [  289.949867] Exception stack(0xc2243fb0 to 0xc2243ff8)
> > [  289.955604] 3fa0:                                     00000000 00000000 00000000 00000000
> > [  289.964840] 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> > [  289.974066] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> > -----8<----
> >
> > Certainly here the mutex causing the problem is the shrinker_lock!
> >
> > The above is triggered by chucking a whole ton of jobs which
> > fault at the GPU.
> >
> > Sadly I haven't found time to work out how to untangle the locks.
>
> They are tricky because pretty much any memory allocation can trigger
> things as I recall.

The above should only be possible with my dma_fence annotations, and
yes the point to bugs in the drm/scheduler. They shouldn't matter for
panfrost, and those patches aren't in upstream yet.
-Daniel
Steven Price Feb. 19, 2021, 1:36 p.m. UTC | #10
On 18/02/2021 18:20, Daniel Vetter wrote:
> On Thu, Feb 18, 2021 at 6:16 PM Rob Herring <robh+dt@kernel.org> wrote:
>>
>> On Thu, Feb 18, 2021 at 10:51 AM Steven Price <steven.price@arm.com> wrote:
>>>
>>> On 18/02/2021 16:38, Rob Herring wrote:
>>>> On Thu, Feb 18, 2021 at 10:15 AM Steven Price <steven.price@arm.com> wrote:
>>>>>
>>>>> On 18/02/2021 15:45, Alyssa Rosenzweig wrote:
>>>>>>> Yeah plus Cc: stable for backporting and I think an igt or similar for
>>>>>>> panfrost to check this works correctly would be pretty good too. Since
>>>>>>> if it took us over 1 year to notice this bug it's pretty clear that
>>>>>>> normal testing doesn't catch this. So very likely we'll break this
>>>>>>> again.
>>>>>>
>>>>>> Unfortunately there are a lot of kernel bugs which are noticed during actual
>>>>>> use (but not CI runs), some of which have never been fixed. I do know
>>>>>> the shrinker impl is buggy for us, if this is the fix I'm very happy.
>>>>>
>>>>> I doubt this will actually "fix" anything - if I understand correctly
>>>>> then the sequence which is broken is:
>>>>>
>>>>>     * allocate BO, mmap to CPU
>>>>>     * madvise(DONTNEED)
>>>>>     * trigger purge
>>>>>     * try to access the BO memory
>>>>>
>>>>> which is an invalid sequence for user space - the attempt to access
>>>>> memory should cause a SIGSEGV. However because drm_vma_node_unmap() is
>>>>> unable to find the mappings there may still be page table entries
>>>>> present which would provide access to memory the kernel has freed. Which
>>>>> is of course a big security hole and so this fix is needed.
>>>>>
>>>>> In what way do you find the shrinker impl buggy? I'm aware there's some
>>>>> dodgy locking (although I haven't worked out how to fix it) - but AFAICT
>>>>> it's more deadlock territory rather than lacking in locks. Are there
>>>>> correctness issues?
>>>>
>>>> What's there was largely a result of getting lockdep happy.
>>>>
>>>>>>> btw for testing shrinkers recommended way is to have a debugfs file
>>>>>>> that just force-shrinks everything. That way you avoid all the trouble
>>>>>>> that tend to happen when you drive a system close to OOM on linux, and
>>>>>>> it's also much faster.
>>>>>>
>>>>>> 2nding this as a good idea.
>>>>>>
>>>>>
>>>>> Sounds like a good idea to me too. But equally I'm wondering whether the
>>>>> best (short term) solution is to actually disable the shrinker. I'm
>>>>> somewhat surprised that nobody has got fed up with the "Purging xxx
>>>>> bytes" message spam - which makes me think that most people are not
>>>>> hitting memory pressure to trigger the shrinker.
>>>>
>>>> If the shrinker is dodgy, then it's probably good to have the messages
>>>> to know if it ran.
>>>>
>>>>> The shrinker on kbase caused a lot of grief - and the only way I managed
>>>>> to get that under control was by writing a static analysis tool for the
>>>>> locking, and by upsetting people by enforcing the (rather dumb) rules of
>>>>> the tool on the code base. I've been meaning to look at whether sparse
>>>>> can do a similar check of locks.
>>>>
>>>> Lockdep doesn't cover it?
>>>
>>> Short answer: no ;)
> 
> It's pretty good actually, if you correctly annotate things up.

I agree - it's pretty good, the problem is you need reasonable test 
coverage, and getting good test coverage of shrinkers is hard.

>>> The problem with lockdep is that you have to trigger the locking
>>> scenario to get a warning out of it. For example you obviously won't get
>>> any warnings about the shrinker without triggering the shrinker (which
>>> means memory pressure since we don't have the debugfs file to trigger it).
>>
>> Actually, you don't need debugfs. Writing to /proc/sys/vm/drop_caches
>> will do it. Though maybe there's other code path scenarios that
>> wouldn't cover.
> 
> Huh didn't know, but it's a bit a shotgun, plus it doesn't use
> fs_reclaim shrinker annotations, which means you don't have lockdep
> checks. I think at least, would need some deadlock and testing.

The big problem with this sort of method for triggering the shrinkers is 
that they are called without (many) locks held. Whereas it's entirely 
possible for a shrinker to be called at (almost) any allocation in the 
kernel.

Admittedly the Panfrost shrinkers are fairly safe - because most things 
are xxx_trylock(). kbase avoids trylock which makes reclaim more 
reliable, but means deadlocks are much easier.

>>
>>> I have to admit I'm not 100% sure I've seen any lockdep warnings based
>>> on buffer objects recently. I can trigger them based on jobs:
>>>
[snip]
>>>
>>> Certainly here the mutex causing the problem is the shrinker_lock!
>>>
>>> The above is triggered by chucking a whole ton of jobs which
>>> fault at the GPU.
>>>
>>> Sadly I haven't found time to work out how to untangle the locks.
>>
>> They are tricky because pretty much any memory allocation can trigger
>> things as I recall.
> 
> The above should only be possible with my dma_fence annotations, and
> yes the point to bugs in the drm/scheduler. They shouldn't matter for
> panfrost, and those patches aren't in upstream yet.

Yes that's on a (random version of) drm-misc - just what I happened to 
have built recently. Good news if that's not actually Panfrost's bug. I 
haven't had the time to track down what's going on yet.

Sounds like I'm perhaps being a bit unfair on the shrinkers - I'm just 
aware that I went down a rabbit hole before looking at changing the 
locks which started because I was convinced having shrinker_lock as a 
mutex was a problem. But it was a while ago and I've forgotten what the 
logic was.

Steve
Daniel Vetter Feb. 19, 2021, 3:13 p.m. UTC | #11
On Fri, Feb 19, 2021 at 01:36:06PM +0000, Steven Price wrote:
> On 18/02/2021 18:20, Daniel Vetter wrote:
> > On Thu, Feb 18, 2021 at 6:16 PM Rob Herring <robh+dt@kernel.org> wrote:
> > > 
> > > On Thu, Feb 18, 2021 at 10:51 AM Steven Price <steven.price@arm.com> wrote:
> > > > 
> > > > On 18/02/2021 16:38, Rob Herring wrote:
> > > > > On Thu, Feb 18, 2021 at 10:15 AM Steven Price <steven.price@arm.com> wrote:
> > > > > > 
> > > > > > On 18/02/2021 15:45, Alyssa Rosenzweig wrote:
> > > > > > > > Yeah plus Cc: stable for backporting and I think an igt or similar for
> > > > > > > > panfrost to check this works correctly would be pretty good too. Since
> > > > > > > > if it took us over 1 year to notice this bug it's pretty clear that
> > > > > > > > normal testing doesn't catch this. So very likely we'll break this
> > > > > > > > again.
> > > > > > > 
> > > > > > > Unfortunately there are a lot of kernel bugs which are noticed during actual
> > > > > > > use (but not CI runs), some of which have never been fixed. I do know
> > > > > > > the shrinker impl is buggy for us, if this is the fix I'm very happy.
> > > > > > 
> > > > > > I doubt this will actually "fix" anything - if I understand correctly
> > > > > > then the sequence which is broken is:
> > > > > > 
> > > > > >     * allocate BO, mmap to CPU
> > > > > >     * madvise(DONTNEED)
> > > > > >     * trigger purge
> > > > > >     * try to access the BO memory
> > > > > > 
> > > > > > which is an invalid sequence for user space - the attempt to access
> > > > > > memory should cause a SIGSEGV. However because drm_vma_node_unmap() is
> > > > > > unable to find the mappings there may still be page table entries
> > > > > > present which would provide access to memory the kernel has freed. Which
> > > > > > is of course a big security hole and so this fix is needed.
> > > > > > 
> > > > > > In what way do you find the shrinker impl buggy? I'm aware there's some
> > > > > > dodgy locking (although I haven't worked out how to fix it) - but AFAICT
> > > > > > it's more deadlock territory rather than lacking in locks. Are there
> > > > > > correctness issues?
> > > > > 
> > > > > What's there was largely a result of getting lockdep happy.
> > > > > 
> > > > > > > > btw for testing shrinkers recommended way is to have a debugfs file
> > > > > > > > that just force-shrinks everything. That way you avoid all the trouble
> > > > > > > > that tend to happen when you drive a system close to OOM on linux, and
> > > > > > > > it's also much faster.
> > > > > > > 
> > > > > > > 2nding this as a good idea.
> > > > > > > 
> > > > > > 
> > > > > > Sounds like a good idea to me too. But equally I'm wondering whether the
> > > > > > best (short term) solution is to actually disable the shrinker. I'm
> > > > > > somewhat surprised that nobody has got fed up with the "Purging xxx
> > > > > > bytes" message spam - which makes me think that most people are not
> > > > > > hitting memory pressure to trigger the shrinker.
> > > > > 
> > > > > If the shrinker is dodgy, then it's probably good to have the messages
> > > > > to know if it ran.
> > > > > 
> > > > > > The shrinker on kbase caused a lot of grief - and the only way I managed
> > > > > > to get that under control was by writing a static analysis tool for the
> > > > > > locking, and by upsetting people by enforcing the (rather dumb) rules of
> > > > > > the tool on the code base. I've been meaning to look at whether sparse
> > > > > > can do a similar check of locks.
> > > > > 
> > > > > Lockdep doesn't cover it?
> > > > 
> > > > Short answer: no ;)
> > 
> > It's pretty good actually, if you correctly annotate things up.
> 
> I agree - it's pretty good, the problem is you need reasonable test
> coverage, and getting good test coverage of shrinkers is hard.
> 
> > > > The problem with lockdep is that you have to trigger the locking
> > > > scenario to get a warning out of it. For example you obviously won't get
> > > > any warnings about the shrinker without triggering the shrinker (which
> > > > means memory pressure since we don't have the debugfs file to trigger it).
> > > 
> > > Actually, you don't need debugfs. Writing to /proc/sys/vm/drop_caches
> > > will do it. Though maybe there's other code path scenarios that
> > > wouldn't cover.
> > 
> > Huh didn't know, but it's a bit a shotgun, plus it doesn't use
> > fs_reclaim shrinker annotations, which means you don't have lockdep
> > checks. I think at least, would need some deadlock and testing.
> 
> The big problem with this sort of method for triggering the shrinkers is
> that they are called without (many) locks held. Whereas it's entirely
> possible for a shrinker to be called at (almost) any allocation in the
> kernel.
> 
> Admittedly the Panfrost shrinkers are fairly safe - because most things are
> xxx_trylock(). kbase avoids trylock which makes reclaim more reliable, but
> means deadlocks are much easier.

This is why you need the fs_reclaim annotation. With that lockdep can
connect the dots. See also might_alloc() annotations I've added in 5.11 or
so.

Validating shrinkers for deadlocks is actually not that hard, you just
need the debugfs interface to run your shrinker at will under the
fs_reclaim_acquire/release annotations. You do _not_ need to hit the full
combinatorial test matrix of making sure that your shrinker is called in
any possible place where memory is allocated.

> > > > I have to admit I'm not 100% sure I've seen any lockdep warnings based
> > > > on buffer objects recently. I can trigger them based on jobs:
> > > > 
> [snip]
> > > > 
> > > > Certainly here the mutex causing the problem is the shrinker_lock!
> > > > 
> > > > The above is triggered by chucking a whole ton of jobs which
> > > > fault at the GPU.
> > > > 
> > > > Sadly I haven't found time to work out how to untangle the locks.
> > > 
> > > They are tricky because pretty much any memory allocation can trigger
> > > things as I recall.
> > 
> > The above should only be possible with my dma_fence annotations, and
> > yes the point to bugs in the drm/scheduler. They shouldn't matter for
> > panfrost, and those patches aren't in upstream yet.
> 
> Yes that's on a (random version of) drm-misc - just what I happened to have
> built recently. Good news if that's not actually Panfrost's bug. I haven't
> had the time to track down what's going on yet.

Are you sure this is really drm-misc? The patch should never have been
there which adds these annotations.

Also help for fixing common code is appreciated :-)

> Sounds like I'm perhaps being a bit unfair on the shrinkers - I'm just aware
> that I went down a rabbit hole before looking at changing the locks which
> started because I was convinced having shrinker_lock as a mutex was a
> problem. But it was a while ago and I've forgotten what the logic was.

Oh memory reclaim and shrinker is definitely a rabbit hole.
-Daniel
Steven Price Feb. 19, 2021, 4:18 p.m. UTC | #12
On 19/02/2021 15:13, Daniel Vetter wrote:
> On Fri, Feb 19, 2021 at 01:36:06PM +0000, Steven Price wrote:
>> On 18/02/2021 18:20, Daniel Vetter wrote:
>>> On Thu, Feb 18, 2021 at 6:16 PM Rob Herring <robh+dt@kernel.org> wrote:
>>>>
>>>> On Thu, Feb 18, 2021 at 10:51 AM Steven Price <steven.price@arm.com> wrote:
>>>>>
>>>>> On 18/02/2021 16:38, Rob Herring wrote:
>>>>>> On Thu, Feb 18, 2021 at 10:15 AM Steven Price <steven.price@arm.com> wrote:
>>>>>>>
>>>>>>> On 18/02/2021 15:45, Alyssa Rosenzweig wrote:
>>>>>>>>> Yeah plus Cc: stable for backporting and I think an igt or similar for
>>>>>>>>> panfrost to check this works correctly would be pretty good too. Since
>>>>>>>>> if it took us over 1 year to notice this bug it's pretty clear that
>>>>>>>>> normal testing doesn't catch this. So very likely we'll break this
>>>>>>>>> again.
>>>>>>>>
>>>>>>>> Unfortunately there are a lot of kernel bugs which are noticed during actual
>>>>>>>> use (but not CI runs), some of which have never been fixed. I do know
>>>>>>>> the shrinker impl is buggy for us, if this is the fix I'm very happy.
>>>>>>>
>>>>>>> I doubt this will actually "fix" anything - if I understand correctly
>>>>>>> then the sequence which is broken is:
>>>>>>>
>>>>>>>      * allocate BO, mmap to CPU
>>>>>>>      * madvise(DONTNEED)
>>>>>>>      * trigger purge
>>>>>>>      * try to access the BO memory
>>>>>>>
>>>>>>> which is an invalid sequence for user space - the attempt to access
>>>>>>> memory should cause a SIGSEGV. However because drm_vma_node_unmap() is
>>>>>>> unable to find the mappings there may still be page table entries
>>>>>>> present which would provide access to memory the kernel has freed. Which
>>>>>>> is of course a big security hole and so this fix is needed.
>>>>>>>
>>>>>>> In what way do you find the shrinker impl buggy? I'm aware there's some
>>>>>>> dodgy locking (although I haven't worked out how to fix it) - but AFAICT
>>>>>>> it's more deadlock territory rather than lacking in locks. Are there
>>>>>>> correctness issues?
>>>>>>
>>>>>> What's there was largely a result of getting lockdep happy.
>>>>>>
>>>>>>>>> btw for testing shrinkers recommended way is to have a debugfs file
>>>>>>>>> that just force-shrinks everything. That way you avoid all the trouble
>>>>>>>>> that tend to happen when you drive a system close to OOM on linux, and
>>>>>>>>> it's also much faster.
>>>>>>>>
>>>>>>>> 2nding this as a good idea.
>>>>>>>>
>>>>>>>
>>>>>>> Sounds like a good idea to me too. But equally I'm wondering whether the
>>>>>>> best (short term) solution is to actually disable the shrinker. I'm
>>>>>>> somewhat surprised that nobody has got fed up with the "Purging xxx
>>>>>>> bytes" message spam - which makes me think that most people are not
>>>>>>> hitting memory pressure to trigger the shrinker.
>>>>>>
>>>>>> If the shrinker is dodgy, then it's probably good to have the messages
>>>>>> to know if it ran.
>>>>>>
>>>>>>> The shrinker on kbase caused a lot of grief - and the only way I managed
>>>>>>> to get that under control was by writing a static analysis tool for the
>>>>>>> locking, and by upsetting people by enforcing the (rather dumb) rules of
>>>>>>> the tool on the code base. I've been meaning to look at whether sparse
>>>>>>> can do a similar check of locks.
>>>>>>
>>>>>> Lockdep doesn't cover it?
>>>>>
>>>>> Short answer: no ;)
>>>
>>> It's pretty good actually, if you correctly annotate things up.
>>
>> I agree - it's pretty good, the problem is you need reasonable test
>> coverage, and getting good test coverage of shrinkers is hard.
>>
>>>>> The problem with lockdep is that you have to trigger the locking
>>>>> scenario to get a warning out of it. For example you obviously won't get
>>>>> any warnings about the shrinker without triggering the shrinker (which
>>>>> means memory pressure since we don't have the debugfs file to trigger it).
>>>>
>>>> Actually, you don't need debugfs. Writing to /proc/sys/vm/drop_caches
>>>> will do it. Though maybe there's other code path scenarios that
>>>> wouldn't cover.
>>>
>>> Huh didn't know, but it's a bit a shotgun, plus it doesn't use
>>> fs_reclaim shrinker annotations, which means you don't have lockdep
>>> checks. I think at least, would need some deadlock and testing.
>>
>> The big problem with this sort of method for triggering the shrinkers is
>> that they are called without (many) locks held. Whereas it's entirely
>> possible for a shrinker to be called at (almost) any allocation in the
>> kernel.
>>
>> Admittedly the Panfrost shrinkers are fairly safe - because most things are
>> xxx_trylock(). kbase avoids trylock which makes reclaim more reliable, but
>> means deadlocks are much easier.
> 
> This is why you need the fs_reclaim annotation. With that lockdep can
> connect the dots. See also might_alloc() annotations I've added in 5.11 or
> so.
> 
> Validating shrinkers for deadlocks is actually not that hard, you just
> need the debugfs interface to run your shrinker at will under the
> fs_reclaim_acquire/release annotations. You do _not_ need to hit the full
> combinatorial test matrix of making sure that your shrinker is called in
> any possible place where memory is allocated.

Cool - I hadn't looked at that code before, but it does look like it 
should pick up the problem cases. I wish that had existed back when I 
was dealing with kbase! :)

>>>>> I have to admit I'm not 100% sure I've seen any lockdep warnings based
>>>>> on buffer objects recently. I can trigger them based on jobs:
>>>>>
>> [snip]
>>>>>
>>>>> Certainly here the mutex causing the problem is the shrinker_lock!
>>>>>
>>>>> The above is triggered by chucking a whole ton of jobs which
>>>>> fault at the GPU.
>>>>>
>>>>> Sadly I haven't found time to work out how to untangle the locks.
>>>>
>>>> They are tricky because pretty much any memory allocation can trigger
>>>> things as I recall.
>>>
>>> The above should only be possible with my dma_fence annotations, and
>>> yes the point to bugs in the drm/scheduler. They shouldn't matter for
>>> panfrost, and those patches aren't in upstream yet.
>>
>> Yes that's on a (random version of) drm-misc - just what I happened to have
>> built recently. Good news if that's not actually Panfrost's bug. I haven't
>> had the time to track down what's going on yet.
> 
> Are you sure this is really drm-misc? The patch should never have been
> there which adds these annotations.

Well drm-misc-next. It's based on commit e4abd7ad2b77 with some pending 
Panfrost fixes on top.

> Also help for fixing common code is appreciated :-)

Indeed - I have put it on my todo list, but I'm not sure when I'll be 
able to spend time on it.

Thanks,

Steve

>> Sounds like I'm perhaps being a bit unfair on the shrinkers - I'm just aware
>> that I went down a rabbit hole before looking at changing the locks which
>> started because I was convinced having shrinker_lock as a mutex was a
>> problem. But it was a while ago and I've forgotten what the logic was.
> 
> Oh memory reclaim and shrinker is definitely a rabbit hole.
> -Daniel
>
Daniel Vetter Feb. 19, 2021, 5:45 p.m. UTC | #13
On Fri, Feb 19, 2021 at 5:17 PM Steven Price <steven.price@arm.com> wrote:
>
> On 19/02/2021 15:13, Daniel Vetter wrote:
> > On Fri, Feb 19, 2021 at 01:36:06PM +0000, Steven Price wrote:
> >> On 18/02/2021 18:20, Daniel Vetter wrote:
> >>> On Thu, Feb 18, 2021 at 6:16 PM Rob Herring <robh+dt@kernel.org> wrote:
> >>>>
> >>>> On Thu, Feb 18, 2021 at 10:51 AM Steven Price <steven.price@arm.com> wrote:
> >>>>>
> >>>>> On 18/02/2021 16:38, Rob Herring wrote:
> >>>>>> On Thu, Feb 18, 2021 at 10:15 AM Steven Price <steven.price@arm.com> wrote:
> >>>>>>>
> >>>>>>> On 18/02/2021 15:45, Alyssa Rosenzweig wrote:
> >>>>>>>>> Yeah plus Cc: stable for backporting and I think an igt or similar for
> >>>>>>>>> panfrost to check this works correctly would be pretty good too. Since
> >>>>>>>>> if it took us over 1 year to notice this bug it's pretty clear that
> >>>>>>>>> normal testing doesn't catch this. So very likely we'll break this
> >>>>>>>>> again.
> >>>>>>>>
> >>>>>>>> Unfortunately there are a lot of kernel bugs which are noticed during actual
> >>>>>>>> use (but not CI runs), some of which have never been fixed. I do know
> >>>>>>>> the shrinker impl is buggy for us, if this is the fix I'm very happy.
> >>>>>>>
> >>>>>>> I doubt this will actually "fix" anything - if I understand correctly
> >>>>>>> then the sequence which is broken is:
> >>>>>>>
> >>>>>>>      * allocate BO, mmap to CPU
> >>>>>>>      * madvise(DONTNEED)
> >>>>>>>      * trigger purge
> >>>>>>>      * try to access the BO memory
> >>>>>>>
> >>>>>>> which is an invalid sequence for user space - the attempt to access
> >>>>>>> memory should cause a SIGSEGV. However because drm_vma_node_unmap() is
> >>>>>>> unable to find the mappings there may still be page table entries
> >>>>>>> present which would provide access to memory the kernel has freed. Which
> >>>>>>> is of course a big security hole and so this fix is needed.
> >>>>>>>
> >>>>>>> In what way do you find the shrinker impl buggy? I'm aware there's some
> >>>>>>> dodgy locking (although I haven't worked out how to fix it) - but AFAICT
> >>>>>>> it's more deadlock territory rather than lacking in locks. Are there
> >>>>>>> correctness issues?
> >>>>>>
> >>>>>> What's there was largely a result of getting lockdep happy.
> >>>>>>
> >>>>>>>>> btw for testing shrinkers recommended way is to have a debugfs file
> >>>>>>>>> that just force-shrinks everything. That way you avoid all the trouble
> >>>>>>>>> that tend to happen when you drive a system close to OOM on linux, and
> >>>>>>>>> it's also much faster.
> >>>>>>>>
> >>>>>>>> 2nding this as a good idea.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Sounds like a good idea to me too. But equally I'm wondering whether the
> >>>>>>> best (short term) solution is to actually disable the shrinker. I'm
> >>>>>>> somewhat surprised that nobody has got fed up with the "Purging xxx
> >>>>>>> bytes" message spam - which makes me think that most people are not
> >>>>>>> hitting memory pressure to trigger the shrinker.
> >>>>>>
> >>>>>> If the shrinker is dodgy, then it's probably good to have the messages
> >>>>>> to know if it ran.
> >>>>>>
> >>>>>>> The shrinker on kbase caused a lot of grief - and the only way I managed
> >>>>>>> to get that under control was by writing a static analysis tool for the
> >>>>>>> locking, and by upsetting people by enforcing the (rather dumb) rules of
> >>>>>>> the tool on the code base. I've been meaning to look at whether sparse
> >>>>>>> can do a similar check of locks.
> >>>>>>
> >>>>>> Lockdep doesn't cover it?
> >>>>>
> >>>>> Short answer: no ;)
> >>>
> >>> It's pretty good actually, if you correctly annotate things up.
> >>
> >> I agree - it's pretty good, the problem is you need reasonable test
> >> coverage, and getting good test coverage of shrinkers is hard.
> >>
> >>>>> The problem with lockdep is that you have to trigger the locking
> >>>>> scenario to get a warning out of it. For example you obviously won't get
> >>>>> any warnings about the shrinker without triggering the shrinker (which
> >>>>> means memory pressure since we don't have the debugfs file to trigger it).
> >>>>
> >>>> Actually, you don't need debugfs. Writing to /proc/sys/vm/drop_caches
> >>>> will do it. Though maybe there's other code path scenarios that
> >>>> wouldn't cover.
> >>>
> >>> Huh didn't know, but it's a bit a shotgun, plus it doesn't use
> >>> fs_reclaim shrinker annotations, which means you don't have lockdep
> >>> checks. I think at least, would need some deadlock and testing.
> >>
> >> The big problem with this sort of method for triggering the shrinkers is
> >> that they are called without (many) locks held. Whereas it's entirely
> >> possible for a shrinker to be called at (almost) any allocation in the
> >> kernel.
> >>
> >> Admittedly the Panfrost shrinkers are fairly safe - because most things are
> >> xxx_trylock(). kbase avoids trylock which makes reclaim more reliable, but
> >> means deadlocks are much easier.
> >
> > This is why you need the fs_reclaim annotation. With that lockdep can
> > connect the dots. See also might_alloc() annotations I've added in 5.11 or
> > so.
> >
> > Validating shrinkers for deadlocks is actually not that hard, you just
> > need the debugfs interface to run your shrinker at will under the
> > fs_reclaim_acquire/release annotations. You do _not_ need to hit the full
> > combinatorial test matrix of making sure that your shrinker is called in
> > any possible place where memory is allocated.
>
> Cool - I hadn't looked at that code before, but it does look like it
> should pick up the problem cases. I wish that had existed back when I
> was dealing with kbase! :)
>
> >>>>> I have to admit I'm not 100% sure I've seen any lockdep warnings based
> >>>>> on buffer objects recently. I can trigger them based on jobs:
> >>>>>
> >> [snip]
> >>>>>
> >>>>> Certainly here the mutex causing the problem is the shrinker_lock!
> >>>>>
> >>>>> The above is triggered by chucking a whole ton of jobs which
> >>>>> fault at the GPU.
> >>>>>
> >>>>> Sadly I haven't found time to work out how to untangle the locks.
> >>>>
> >>>> They are tricky because pretty much any memory allocation can trigger
> >>>> things as I recall.
> >>>
> >>> The above should only be possible with my dma_fence annotations, and
> >>> yes the point to bugs in the drm/scheduler. They shouldn't matter for
> >>> panfrost, and those patches aren't in upstream yet.
> >>
> >> Yes that's on a (random version of) drm-misc - just what I happened to have
> >> built recently. Good news if that's not actually Panfrost's bug. I haven't
> >> had the time to track down what's going on yet.
> >
> > Are you sure this is really drm-misc? The patch should never have been
> > there which adds these annotations.
>
> Well drm-misc-next. It's based on commit e4abd7ad2b77 with some pending
> Panfrost fixes on top.
>
> > Also help for fixing common code is appreciated :-)
>
> Indeed - I have put it on my todo list, but I'm not sure when I'll be
> able to spend time on it.

Oh there's indeed panfrost annotations in panfrost_reset(), but I
guess that stuff just wasn't ever really tested when it landed?

commit 5bc5cc2819c2c0adb644919e3e790b504ea47e0a
Author: Boris Brezillon <boris.brezillon@collabora.com>
Date:   Thu Nov 5 16:17:04 2020 +0100

   drm/panfrost: Move the GPU reset bits outside the timeout handler

Adding Boris, since my idea wasn't to just add more lockdep splats to
panfrost, but help make sure the reset code works correctly.
-Daniel

>
> Thanks,
>
> Steve
>
> >> Sounds like I'm perhaps being a bit unfair on the shrinkers - I'm just aware
> >> that I went down a rabbit hole before looking at changing the locks which
> >> started because I was convinced having shrinker_lock as a mutex was a
> >> problem. But it was a while ago and I've forgotten what the logic was.
> >
> > Oh memory reclaim and shrinker is definitely a rabbit hole.
> > -Daniel
> >
>
Boris Brezillon Feb. 22, 2021, 8:34 a.m. UTC | #14
Hi Daniel,

On Fri, 19 Feb 2021 18:45:00 +0100
Daniel Vetter <daniel@ffwll.ch> wrote:

> On Fri, Feb 19, 2021 at 5:17 PM Steven Price <steven.price@arm.com> wrote:
> >
> > On 19/02/2021 15:13, Daniel Vetter wrote:  
> > > On Fri, Feb 19, 2021 at 01:36:06PM +0000, Steven Price wrote:  
> > >> On 18/02/2021 18:20, Daniel Vetter wrote:  
> > >>> On Thu, Feb 18, 2021 at 6:16 PM Rob Herring <robh+dt@kernel.org> wrote:  
> > >>>>
> > >>>> On Thu, Feb 18, 2021 at 10:51 AM Steven Price <steven.price@arm.com> wrote:  
> > >>>>>
> > >>>>> On 18/02/2021 16:38, Rob Herring wrote:  
> > >>>>>> On Thu, Feb 18, 2021 at 10:15 AM Steven Price <steven.price@arm.com> wrote:  
> > >>>>>>>
> > >>>>>>> On 18/02/2021 15:45, Alyssa Rosenzweig wrote:  
> > >>>>>>>>> Yeah plus Cc: stable for backporting and I think an igt or similar for
> > >>>>>>>>> panfrost to check this works correctly would be pretty good too. Since
> > >>>>>>>>> if it took us over 1 year to notice this bug it's pretty clear that
> > >>>>>>>>> normal testing doesn't catch this. So very likely we'll break this
> > >>>>>>>>> again.  
> > >>>>>>>>
> > >>>>>>>> Unfortunately there are a lot of kernel bugs which are noticed during actual
> > >>>>>>>> use (but not CI runs), some of which have never been fixed. I do know
> > >>>>>>>> the shrinker impl is buggy for us, if this is the fix I'm very happy.  
> > >>>>>>>
> > >>>>>>> I doubt this will actually "fix" anything - if I understand correctly
> > >>>>>>> then the sequence which is broken is:
> > >>>>>>>
> > >>>>>>>      * allocate BO, mmap to CPU
> > >>>>>>>      * madvise(DONTNEED)
> > >>>>>>>      * trigger purge
> > >>>>>>>      * try to access the BO memory
> > >>>>>>>
> > >>>>>>> which is an invalid sequence for user space - the attempt to access
> > >>>>>>> memory should cause a SIGSEGV. However because drm_vma_node_unmap() is
> > >>>>>>> unable to find the mappings there may still be page table entries
> > >>>>>>> present which would provide access to memory the kernel has freed. Which
> > >>>>>>> is of course a big security hole and so this fix is needed.
> > >>>>>>>
> > >>>>>>> In what way do you find the shrinker impl buggy? I'm aware there's some
> > >>>>>>> dodgy locking (although I haven't worked out how to fix it) - but AFAICT
> > >>>>>>> it's more deadlock territory rather than lacking in locks. Are there
> > >>>>>>> correctness issues?  
> > >>>>>>
> > >>>>>> What's there was largely a result of getting lockdep happy.
> > >>>>>>  
> > >>>>>>>>> btw for testing shrinkers recommended way is to have a debugfs file
> > >>>>>>>>> that just force-shrinks everything. That way you avoid all the trouble
> > >>>>>>>>> that tend to happen when you drive a system close to OOM on linux, and
> > >>>>>>>>> it's also much faster.  
> > >>>>>>>>
> > >>>>>>>> 2nding this as a good idea.
> > >>>>>>>>  
> > >>>>>>>
> > >>>>>>> Sounds like a good idea to me too. But equally I'm wondering whether the
> > >>>>>>> best (short term) solution is to actually disable the shrinker. I'm
> > >>>>>>> somewhat surprised that nobody has got fed up with the "Purging xxx
> > >>>>>>> bytes" message spam - which makes me think that most people are not
> > >>>>>>> hitting memory pressure to trigger the shrinker.  
> > >>>>>>
> > >>>>>> If the shrinker is dodgy, then it's probably good to have the messages
> > >>>>>> to know if it ran.
> > >>>>>>  
> > >>>>>>> The shrinker on kbase caused a lot of grief - and the only way I managed
> > >>>>>>> to get that under control was by writing a static analysis tool for the
> > >>>>>>> locking, and by upsetting people by enforcing the (rather dumb) rules of
> > >>>>>>> the tool on the code base. I've been meaning to look at whether sparse
> > >>>>>>> can do a similar check of locks.  
> > >>>>>>
> > >>>>>> Lockdep doesn't cover it?  
> > >>>>>
> > >>>>> Short answer: no ;)  
> > >>>
> > >>> It's pretty good actually, if you correctly annotate things up.  
> > >>
> > >> I agree - it's pretty good, the problem is you need reasonable test
> > >> coverage, and getting good test coverage of shrinkers is hard.
> > >>  
> > >>>>> The problem with lockdep is that you have to trigger the locking
> > >>>>> scenario to get a warning out of it. For example you obviously won't get
> > >>>>> any warnings about the shrinker without triggering the shrinker (which
> > >>>>> means memory pressure since we don't have the debugfs file to trigger it).  
> > >>>>
> > >>>> Actually, you don't need debugfs. Writing to /proc/sys/vm/drop_caches
> > >>>> will do it. Though maybe there's other code path scenarios that
> > >>>> wouldn't cover.  
> > >>>
> > >>> Huh didn't know, but it's a bit a shotgun, plus it doesn't use
> > >>> fs_reclaim shrinker annotations, which means you don't have lockdep
> > >>> checks. I think at least, would need some deadlock and testing.  
> > >>
> > >> The big problem with this sort of method for triggering the shrinkers is
> > >> that they are called without (many) locks held. Whereas it's entirely
> > >> possible for a shrinker to be called at (almost) any allocation in the
> > >> kernel.
> > >>
> > >> Admittedly the Panfrost shrinkers are fairly safe - because most things are
> > >> xxx_trylock(). kbase avoids trylock which makes reclaim more reliable, but
> > >> means deadlocks are much easier.  
> > >
> > > This is why you need the fs_reclaim annotation. With that lockdep can
> > > connect the dots. See also might_alloc() annotations I've added in 5.11 or
> > > so.
> > >
> > > Validating shrinkers for deadlocks is actually not that hard, you just
> > > need the debugfs interface to run your shrinker at will under the
> > > fs_reclaim_acquire/release annotations. You do _not_ need to hit the full
> > > combinatorial test matrix of making sure that your shrinker is called in
> > > any possible place where memory is allocated.  
> >
> > Cool - I hadn't looked at that code before, but it does look like it
> > should pick up the problem cases. I wish that had existed back when I
> > was dealing with kbase! :)
> >  
> > >>>>> I have to admit I'm not 100% sure I've seen any lockdep warnings based
> > >>>>> on buffer objects recently. I can trigger them based on jobs:
> > >>>>>  
> > >> [snip]  
> > >>>>>
> > >>>>> Certainly here the mutex causing the problem is the shrinker_lock!
> > >>>>>
> > >>>>> The above is triggered by chucking a whole ton of jobs which
> > >>>>> fault at the GPU.
> > >>>>>
> > >>>>> Sadly I haven't found time to work out how to untangle the locks.  
> > >>>>
> > >>>> They are tricky because pretty much any memory allocation can trigger
> > >>>> things as I recall.  
> > >>>
> > >>> The above should only be possible with my dma_fence annotations, and
> > >>> yes the point to bugs in the drm/scheduler. They shouldn't matter for
> > >>> panfrost, and those patches aren't in upstream yet.  
> > >>
> > >> Yes that's on a (random version of) drm-misc - just what I happened to have
> > >> built recently. Good news if that's not actually Panfrost's bug. I haven't
> > >> had the time to track down what's going on yet.  
> > >
> > > Are you sure this is really drm-misc? The patch should never have been
> > > there which adds these annotations.  
> >
> > Well drm-misc-next. It's based on commit e4abd7ad2b77 with some pending
> > Panfrost fixes on top.
> >  
> > > Also help for fixing common code is appreciated :-)  
> >
> > Indeed - I have put it on my todo list, but I'm not sure when I'll be
> > able to spend time on it.  
> 
> Oh there's indeed panfrost annotations in panfrost_reset(), but I
> guess that stuff just wasn't ever really tested when it landed?
> 
> commit 5bc5cc2819c2c0adb644919e3e790b504ea47e0a
> Author: Boris Brezillon <boris.brezillon@collabora.com>
> Date:   Thu Nov 5 16:17:04 2020 +0100
> 
>    drm/panfrost: Move the GPU reset bits outside the timeout handler
> 
> Adding Boris, since my idea wasn't to just add more lockdep splats to
> panfrost, but help make sure the reset code works correctly.

Okay, I'm a bit tired of this "oh, this is not really what I asked"
reply I got when you realize the code you acked/reviewed is not what
you expected. The dmafence annotation stuff is far from trivial, and I
remember asking for help when I submitted the version adding those
annotations in the reset path. Note that I didn't intend to spend days
on that stuff since all I was trying to do with this patch was to fix a
deadlock we had in the reset path, deadlock that was not related to dma
fence signaling at all. So yes, it was not thoroughly tested
(especially the fence signaling stuff), but enough to validate the
workloads that triggered the deadlock I was trying to fix no longer
caused this deadlock.

Don't get me wrong, I'm all for improving the DRM subsystem, and this
fencing annotation is probably a great thing, but you can't expect
people to take on those complex tasks without a bit of supervision.

Regards,

Boris
Thomas Zimmermann Feb. 22, 2021, 2:24 p.m. UTC | #15
Hi

Am 17.02.21 um 17:59 schrieb Neil Roberts:
> When mmapping the shmem, it would previously adjust the pgoff in the
> vm_area_struct to remove the fake offset that is added to be able to
> identify the buffer. This patch removes the adjustment and makes the
> fault handler use the vm_fault address to calculate the page offset
> instead. Although using this address is apparently discouraged, several
> DRM drivers seem to be doing it anyway.
> 
> The problem with removing the pgoff is that it prevents
> drm_vma_node_unmap from working because that searches the mapping tree
> by address. That doesn't work because all of the mappings are at offset
> 0. drm_vma_node_unmap is being used by the shmem helpers when purging
> the buffer.

I just want to ask if this is how the mmap callbacks are supposed to 
work now?

I have a number of patches in here that convert several drivers to the 
GEM object's mmap callback. They might not be affected by the vm_pgoff 
bug, but I'd like to submit something that could work in the longer term.

Best regards
Thomas

> 
> It looks like panfrost is using drm_gem_shmem_purge so this might fix a
> potential bug there.
> 
> Signed-off-by: Neil Roberts <nroberts@igalia.com>
> ---
>   drivers/gpu/drm/drm_gem_shmem_helper.c | 12 +++++++-----
>   1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c b/drivers/gpu/drm/drm_gem_shmem_helper.c
> index 9825c378dfa6..4b14157f1962 100644
> --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
> +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
> @@ -526,11 +526,16 @@ static vm_fault_t drm_gem_shmem_fault(struct vm_fault *vmf)
>   	struct drm_gem_shmem_object *shmem = to_drm_gem_shmem_obj(obj);
>   	loff_t num_pages = obj->size >> PAGE_SHIFT;
>   	struct page *page;
> +	pgoff_t page_offset;
>   
> -	if (vmf->pgoff >= num_pages || WARN_ON_ONCE(!shmem->pages))
> +	/* We don't use vmf->pgoff since that has the fake offset */
> +	page_offset = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
> +
> +	if (page_offset < 0 || page_offset >= num_pages ||
> +	    WARN_ON_ONCE(!shmem->pages))
>   		return VM_FAULT_SIGBUS;
>   
> -	page = shmem->pages[vmf->pgoff];
> +	page = shmem->pages[page_offset];
>   
>   	return vmf_insert_page(vma, vmf->address, page);
>   }
> @@ -581,9 +586,6 @@ int drm_gem_shmem_mmap(struct drm_gem_object *obj, struct vm_area_struct *vma)
>   	struct drm_gem_shmem_object *shmem;
>   	int ret;
>   
> -	/* Remove the fake offset */
> -	vma->vm_pgoff -= drm_vma_node_start(&obj->vma_node);
> -
>   	if (obj->import_attach) {
>   		/* Drop the reference drm_gem_mmap_obj() acquired.*/
>   		drm_gem_object_put(obj);
>
Daniel Vetter Feb. 22, 2021, 4:20 p.m. UTC | #16
On Mon, Feb 22, 2021 at 09:34:43AM +0100, Boris Brezillon wrote:
> Hi Daniel,
> 
> On Fri, 19 Feb 2021 18:45:00 +0100
> Daniel Vetter <daniel@ffwll.ch> wrote:
> 
> > On Fri, Feb 19, 2021 at 5:17 PM Steven Price <steven.price@arm.com> wrote:
> > >
> > > On 19/02/2021 15:13, Daniel Vetter wrote:  
> > > > On Fri, Feb 19, 2021 at 01:36:06PM +0000, Steven Price wrote:  
> > > >> On 18/02/2021 18:20, Daniel Vetter wrote:  
> > > >>> On Thu, Feb 18, 2021 at 6:16 PM Rob Herring <robh+dt@kernel.org> wrote:  
> > > >>>>
> > > >>>> On Thu, Feb 18, 2021 at 10:51 AM Steven Price <steven.price@arm.com> wrote:  
> > > >>>>>
> > > >>>>> On 18/02/2021 16:38, Rob Herring wrote:  
> > > >>>>>> On Thu, Feb 18, 2021 at 10:15 AM Steven Price <steven.price@arm.com> wrote:  
> > > >>>>>>>
> > > >>>>>>> On 18/02/2021 15:45, Alyssa Rosenzweig wrote:  
> > > >>>>>>>>> Yeah plus Cc: stable for backporting and I think an igt or similar for
> > > >>>>>>>>> panfrost to check this works correctly would be pretty good too. Since
> > > >>>>>>>>> if it took us over 1 year to notice this bug it's pretty clear that
> > > >>>>>>>>> normal testing doesn't catch this. So very likely we'll break this
> > > >>>>>>>>> again.  
> > > >>>>>>>>
> > > >>>>>>>> Unfortunately there are a lot of kernel bugs which are noticed during actual
> > > >>>>>>>> use (but not CI runs), some of which have never been fixed. I do know
> > > >>>>>>>> the shrinker impl is buggy for us, if this is the fix I'm very happy.  
> > > >>>>>>>
> > > >>>>>>> I doubt this will actually "fix" anything - if I understand correctly
> > > >>>>>>> then the sequence which is broken is:
> > > >>>>>>>
> > > >>>>>>>      * allocate BO, mmap to CPU
> > > >>>>>>>      * madvise(DONTNEED)
> > > >>>>>>>      * trigger purge
> > > >>>>>>>      * try to access the BO memory
> > > >>>>>>>
> > > >>>>>>> which is an invalid sequence for user space - the attempt to access
> > > >>>>>>> memory should cause a SIGSEGV. However because drm_vma_node_unmap() is
> > > >>>>>>> unable to find the mappings there may still be page table entries
> > > >>>>>>> present which would provide access to memory the kernel has freed. Which
> > > >>>>>>> is of course a big security hole and so this fix is needed.
> > > >>>>>>>
> > > >>>>>>> In what way do you find the shrinker impl buggy? I'm aware there's some
> > > >>>>>>> dodgy locking (although I haven't worked out how to fix it) - but AFAICT
> > > >>>>>>> it's more deadlock territory rather than lacking in locks. Are there
> > > >>>>>>> correctness issues?  
> > > >>>>>>
> > > >>>>>> What's there was largely a result of getting lockdep happy.
> > > >>>>>>  
> > > >>>>>>>>> btw for testing shrinkers recommended way is to have a debugfs file
> > > >>>>>>>>> that just force-shrinks everything. That way you avoid all the trouble
> > > >>>>>>>>> that tend to happen when you drive a system close to OOM on linux, and
> > > >>>>>>>>> it's also much faster.  
> > > >>>>>>>>
> > > >>>>>>>> 2nding this as a good idea.
> > > >>>>>>>>  
> > > >>>>>>>
> > > >>>>>>> Sounds like a good idea to me too. But equally I'm wondering whether the
> > > >>>>>>> best (short term) solution is to actually disable the shrinker. I'm
> > > >>>>>>> somewhat surprised that nobody has got fed up with the "Purging xxx
> > > >>>>>>> bytes" message spam - which makes me think that most people are not
> > > >>>>>>> hitting memory pressure to trigger the shrinker.  
> > > >>>>>>
> > > >>>>>> If the shrinker is dodgy, then it's probably good to have the messages
> > > >>>>>> to know if it ran.
> > > >>>>>>  
> > > >>>>>>> The shrinker on kbase caused a lot of grief - and the only way I managed
> > > >>>>>>> to get that under control was by writing a static analysis tool for the
> > > >>>>>>> locking, and by upsetting people by enforcing the (rather dumb) rules of
> > > >>>>>>> the tool on the code base. I've been meaning to look at whether sparse
> > > >>>>>>> can do a similar check of locks.  
> > > >>>>>>
> > > >>>>>> Lockdep doesn't cover it?  
> > > >>>>>
> > > >>>>> Short answer: no ;)  
> > > >>>
> > > >>> It's pretty good actually, if you correctly annotate things up.  
> > > >>
> > > >> I agree - it's pretty good, the problem is you need reasonable test
> > > >> coverage, and getting good test coverage of shrinkers is hard.
> > > >>  
> > > >>>>> The problem with lockdep is that you have to trigger the locking
> > > >>>>> scenario to get a warning out of it. For example you obviously won't get
> > > >>>>> any warnings about the shrinker without triggering the shrinker (which
> > > >>>>> means memory pressure since we don't have the debugfs file to trigger it).  
> > > >>>>
> > > >>>> Actually, you don't need debugfs. Writing to /proc/sys/vm/drop_caches
> > > >>>> will do it. Though maybe there's other code path scenarios that
> > > >>>> wouldn't cover.  
> > > >>>
> > > >>> Huh didn't know, but it's a bit a shotgun, plus it doesn't use
> > > >>> fs_reclaim shrinker annotations, which means you don't have lockdep
> > > >>> checks. I think at least, would need some deadlock and testing.  
> > > >>
> > > >> The big problem with this sort of method for triggering the shrinkers is
> > > >> that they are called without (many) locks held. Whereas it's entirely
> > > >> possible for a shrinker to be called at (almost) any allocation in the
> > > >> kernel.
> > > >>
> > > >> Admittedly the Panfrost shrinkers are fairly safe - because most things are
> > > >> xxx_trylock(). kbase avoids trylock which makes reclaim more reliable, but
> > > >> means deadlocks are much easier.  
> > > >
> > > > This is why you need the fs_reclaim annotation. With that lockdep can
> > > > connect the dots. See also might_alloc() annotations I've added in 5.11 or
> > > > so.
> > > >
> > > > Validating shrinkers for deadlocks is actually not that hard, you just
> > > > need the debugfs interface to run your shrinker at will under the
> > > > fs_reclaim_acquire/release annotations. You do _not_ need to hit the full
> > > > combinatorial test matrix of making sure that your shrinker is called in
> > > > any possible place where memory is allocated.  
> > >
> > > Cool - I hadn't looked at that code before, but it does look like it
> > > should pick up the problem cases. I wish that had existed back when I
> > > was dealing with kbase! :)
> > >  
> > > >>>>> I have to admit I'm not 100% sure I've seen any lockdep warnings based
> > > >>>>> on buffer objects recently. I can trigger them based on jobs:
> > > >>>>>  
> > > >> [snip]  
> > > >>>>>
> > > >>>>> Certainly here the mutex causing the problem is the shrinker_lock!
> > > >>>>>
> > > >>>>> The above is triggered by chucking a whole ton of jobs which
> > > >>>>> fault at the GPU.
> > > >>>>>
> > > >>>>> Sadly I haven't found time to work out how to untangle the locks.  
> > > >>>>
> > > >>>> They are tricky because pretty much any memory allocation can trigger
> > > >>>> things as I recall.  
> > > >>>
> > > >>> The above should only be possible with my dma_fence annotations, and
> > > >>> yes the point to bugs in the drm/scheduler. They shouldn't matter for
> > > >>> panfrost, and those patches aren't in upstream yet.  
> > > >>
> > > >> Yes that's on a (random version of) drm-misc - just what I happened to have
> > > >> built recently. Good news if that's not actually Panfrost's bug. I haven't
> > > >> had the time to track down what's going on yet.  
> > > >
> > > > Are you sure this is really drm-misc? The patch should never have been
> > > > there which adds these annotations.  
> > >
> > > Well drm-misc-next. It's based on commit e4abd7ad2b77 with some pending
> > > Panfrost fixes on top.
> > >  
> > > > Also help for fixing common code is appreciated :-)  
> > >
> > > Indeed - I have put it on my todo list, but I'm not sure when I'll be
> > > able to spend time on it.  
> > 
> > Oh there's indeed panfrost annotations in panfrost_reset(), but I
> > guess that stuff just wasn't ever really tested when it landed?
> > 
> > commit 5bc5cc2819c2c0adb644919e3e790b504ea47e0a
> > Author: Boris Brezillon <boris.brezillon@collabora.com>
> > Date:   Thu Nov 5 16:17:04 2020 +0100
> > 
> >    drm/panfrost: Move the GPU reset bits outside the timeout handler
> > 
> > Adding Boris, since my idea wasn't to just add more lockdep splats to
> > panfrost, but help make sure the reset code works correctly.
> 
> Okay, I'm a bit tired of this "oh, this is not really what I asked"
> reply I got when you realize the code you acked/reviewed is not what
> you expected. The dmafence annotation stuff is far from trivial, and I
> remember asking for help when I submitted the version adding those
> annotations in the reset path. Note that I didn't intend to spend days
> on that stuff since all I was trying to do with this patch was to fix a
> deadlock we had in the reset path, deadlock that was not related to dma
> fence signaling at all. So yes, it was not thoroughly tested
> (especially the fence signaling stuff), but enough to validate the
> workloads that triggered the deadlock I was trying to fix no longer
> caused this deadlock.
> 
> Don't get me wrong, I'm all for improving the DRM subsystem, and this
> fencing annotation is probably a great thing, but you can't expect
> people to take on those complex tasks without a bit of supervision.

The annotations are correct afaik, that's still the case.

The idea behind them is to catch very hard to spot&understand bugs, and
I'm happy to help with figuring out what needs to be shuffled around.

What I'm kinda suprised about is that there's not enough
CI/testing/whatever to actually hit all the paths, since from looking at
the splat Steven has as long as you hit both the shrinker and the reset
path it should splat.

Other thing I'm wondering is why no one brought this up as a
regression/issue. My goal here was to help catch bugs, not make lockdep
less useful (since after the first splat it's gone and useless), that's
why I made my comment.
-Daniel
Daniel Vetter Feb. 22, 2021, 4:21 p.m. UTC | #17
On Mon, Feb 22, 2021 at 03:24:17PM +0100, Thomas Zimmermann wrote:
> Hi
> 
> Am 17.02.21 um 17:59 schrieb Neil Roberts:
> > When mmapping the shmem, it would previously adjust the pgoff in the
> > vm_area_struct to remove the fake offset that is added to be able to
> > identify the buffer. This patch removes the adjustment and makes the
> > fault handler use the vm_fault address to calculate the page offset
> > instead. Although using this address is apparently discouraged, several
> > DRM drivers seem to be doing it anyway.
> > 
> > The problem with removing the pgoff is that it prevents
> > drm_vma_node_unmap from working because that searches the mapping tree
> > by address. That doesn't work because all of the mappings are at offset
> > 0. drm_vma_node_unmap is being used by the shmem helpers when purging
> > the buffer.
> 
> I just want to ask if this is how the mmap callbacks are supposed to work
> now?
> 
> I have a number of patches in here that convert several drivers to the GEM
> object's mmap callback. They might not be affected by the vm_pgoff bug, but
> I'd like to submit something that could work in the longer term.

Yeah we pretty much require the uniq vm_pgoff for runtime unmapping.
Especially with more dynamic memory managers like ttm that move buffers
around - for more static ones (most of the armsoc ones) it's just a bit a
security issue since you can potentially access memory after it's gone.
-Daniel

> Best regards
> Thomas
> 
> > 
> > It looks like panfrost is using drm_gem_shmem_purge so this might fix a
> > potential bug there.
> > 
> > Signed-off-by: Neil Roberts <nroberts@igalia.com>
> > ---
> >   drivers/gpu/drm/drm_gem_shmem_helper.c | 12 +++++++-----
> >   1 file changed, 7 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c b/drivers/gpu/drm/drm_gem_shmem_helper.c
> > index 9825c378dfa6..4b14157f1962 100644
> > --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
> > +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
> > @@ -526,11 +526,16 @@ static vm_fault_t drm_gem_shmem_fault(struct vm_fault *vmf)
> >   	struct drm_gem_shmem_object *shmem = to_drm_gem_shmem_obj(obj);
> >   	loff_t num_pages = obj->size >> PAGE_SHIFT;
> >   	struct page *page;
> > +	pgoff_t page_offset;
> > -	if (vmf->pgoff >= num_pages || WARN_ON_ONCE(!shmem->pages))
> > +	/* We don't use vmf->pgoff since that has the fake offset */
> > +	page_offset = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
> > +
> > +	if (page_offset < 0 || page_offset >= num_pages ||
> > +	    WARN_ON_ONCE(!shmem->pages))
> >   		return VM_FAULT_SIGBUS;
> > -	page = shmem->pages[vmf->pgoff];
> > +	page = shmem->pages[page_offset];
> >   	return vmf_insert_page(vma, vmf->address, page);
> >   }
> > @@ -581,9 +586,6 @@ int drm_gem_shmem_mmap(struct drm_gem_object *obj, struct vm_area_struct *vma)
> >   	struct drm_gem_shmem_object *shmem;
> >   	int ret;
> > -	/* Remove the fake offset */
> > -	vma->vm_pgoff -= drm_vma_node_start(&obj->vma_node);
> > -
> >   	if (obj->import_attach) {
> >   		/* Drop the reference drm_gem_mmap_obj() acquired.*/
> >   		drm_gem_object_put(obj);
> > 
> 
> -- 
> Thomas Zimmermann
> Graphics Driver Developer
> SUSE Software Solutions Germany GmbH
> Maxfeldstr. 5, 90409 Nürnberg, Germany
> (HRB 36809, AG Nürnberg)
> Geschäftsführer: Felix Imendörffer
> 




> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Neil Roberts Feb. 23, 2021, 12:42 p.m. UTC | #18
Daniel Vetter <daniel@ffwll.ch> writes:

> Yeah plus Cc: stable for backporting and I think an igt or similar for
> panfrost to check this works correctly would be pretty good too. Since
> if it took us over 1 year to notice this bug it's pretty clear that
> normal testing doesn't catch this. So very likely we'll break this
> again.

I made the IGT test below which seems to reproduce the bug. However, the
kernel patch doesn’t fix it, so maybe there is something more subtle
going on.

https://gitlab.freedesktop.org/nroberts/igt-gpu-tools/-/commits/panfrost-purgemap/

Regards,
- Neil
Daniel Vetter Feb. 23, 2021, 12:54 p.m. UTC | #19
On Tue, Feb 23, 2021 at 1:42 PM Neil Roberts <nroberts@igalia.com> wrote:
>
> Daniel Vetter <daniel@ffwll.ch> writes:
>
> > Yeah plus Cc: stable for backporting and I think an igt or similar for
> > panfrost to check this works correctly would be pretty good too. Since
> > if it took us over 1 year to notice this bug it's pretty clear that
> > normal testing doesn't catch this. So very likely we'll break this
> > again.
>
> I made the IGT test below which seems to reproduce the bug. However, the
> kernel patch doesn’t fix it, so maybe there is something more subtle
> going on.
>
> https://gitlab.freedesktop.org/nroberts/igt-gpu-tools/-/commits/panfrost-purgemap/

drm_gem_shmem_fault() does not seem to check for purged objects at all.

No idea how this works, or if it ever worked, but yeah something is
clearly still busted.

Definitely a good idae to have an igt. btw to make that faster you can
either use the vm_drop_caches file from proc (it's a bit a hammer), or
what I recommend: Have a dedicated debugfs file to only drop
everything from your shrinker. That's much quicker and  controlled.
See e.g. ttm_tt_debugfs_shrink from d4bd7776a7ac ("drm/ttm: rework
ttm_tt page limit v4") which recently landed in drm-misc-next.
-Daniel
Neil Roberts Feb. 23, 2021, 3:59 p.m. UTC | #20
Daniel Vetter <daniel@ffwll.ch> writes:

> drm_gem_shmem_fault() does not seem to check for purged objects at all.
>
> No idea how this works, or if it ever worked, but yeah something is
> clearly still busted.

Oh of course, the fault handler doesn’t check this. I’ve added a second
patch to make it check and posted it as a separate series here:

https://lists.freedesktop.org/archives/dri-devel/2021-February/298170.html

The two patches combined make the IGT test pass.

> Definitely a good idae to have an igt. btw to make that faster you can
> either use the vm_drop_caches file from proc (it's a bit a hammer), or
> what I recommend: Have a dedicated debugfs file to only drop
> everything from your shrinker. That's much quicker and  controlled.
> See e.g. ttm_tt_debugfs_shrink from d4bd7776a7ac ("drm/ttm: rework
> ttm_tt page limit v4") which recently landed in drm-misc-next.

I agree it would be great to have a debugfs option to trigger the purge.
I wonder if someone more involved in Panfrost would like to implement
this, because I am actually trying to work on VC4 and this is already
turning out to be quite a lot of yak shaving :) I’d also like to
implement the same debugfs option and IGT test for VC4.

Thanks for the feedback.

Regards,
- Neil
Daniel Vetter Feb. 23, 2021, 4:40 p.m. UTC | #21
On Tue, Feb 23, 2021 at 4:59 PM Neil Roberts <nroberts@igalia.com> wrote:
>
> Daniel Vetter <daniel@ffwll.ch> writes:
>
> > drm_gem_shmem_fault() does not seem to check for purged objects at all.
> >
> > No idea how this works, or if it ever worked, but yeah something is
> > clearly still busted.
>
> Oh of course, the fault handler doesn’t check this. I’ve added a second
> patch to make it check and posted it as a separate series here:
>
> https://lists.freedesktop.org/archives/dri-devel/2021-February/298170.html
>
> The two patches combined make the IGT test pass.
>
> > Definitely a good idae to have an igt. btw to make that faster you can
> > either use the vm_drop_caches file from proc (it's a bit a hammer), or
> > what I recommend: Have a dedicated debugfs file to only drop
> > everything from your shrinker. That's much quicker and  controlled.
> > See e.g. ttm_tt_debugfs_shrink from d4bd7776a7ac ("drm/ttm: rework
> > ttm_tt page limit v4") which recently landed in drm-misc-next.
>
> I agree it would be great to have a debugfs option to trigger the purge.
> I wonder if someone more involved in Panfrost would like to implement
> this, because I am actually trying to work on VC4 and this is already
> turning out to be quite a lot of yak shaving :) I’d also like to
> implement the same debugfs option and IGT test for VC4.

If we push the shrinker setup into the helpers (this means we
minimally need an lru, and probably more reasonable locking that shmem
helpers uses right now) then we could have one debugfs file for all
drivers supporting purgeable objects. Could then even share some of
the igt, only the ioctl code would need to be driver specific.

It's a bit of work though.
-Daniel

> Thanks for the feedback.
>
> Regards,
> - Neil
diff mbox series

Patch

diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c b/drivers/gpu/drm/drm_gem_shmem_helper.c
index 9825c378dfa6..4b14157f1962 100644
--- a/drivers/gpu/drm/drm_gem_shmem_helper.c
+++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
@@ -526,11 +526,16 @@  static vm_fault_t drm_gem_shmem_fault(struct vm_fault *vmf)
 	struct drm_gem_shmem_object *shmem = to_drm_gem_shmem_obj(obj);
 	loff_t num_pages = obj->size >> PAGE_SHIFT;
 	struct page *page;
+	pgoff_t page_offset;
 
-	if (vmf->pgoff >= num_pages || WARN_ON_ONCE(!shmem->pages))
+	/* We don't use vmf->pgoff since that has the fake offset */
+	page_offset = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
+
+	if (page_offset < 0 || page_offset >= num_pages ||
+	    WARN_ON_ONCE(!shmem->pages))
 		return VM_FAULT_SIGBUS;
 
-	page = shmem->pages[vmf->pgoff];
+	page = shmem->pages[page_offset];
 
 	return vmf_insert_page(vma, vmf->address, page);
 }
@@ -581,9 +586,6 @@  int drm_gem_shmem_mmap(struct drm_gem_object *obj, struct vm_area_struct *vma)
 	struct drm_gem_shmem_object *shmem;
 	int ret;
 
-	/* Remove the fake offset */
-	vma->vm_pgoff -= drm_vma_node_start(&obj->vma_node);
-
 	if (obj->import_attach) {
 		/* Drop the reference drm_gem_mmap_obj() acquired.*/
 		drm_gem_object_put(obj);