diff mbox series

[2/5] kernel.h: Add non_block_start/end()

Message ID 20190814202027.18735-3-daniel.vetter@ffwll.ch (mailing list archive)
State New, archived
Headers show
Series hmm & mmu_notifier debug/lockdep annotations | expand

Commit Message

Daniel Vetter Aug. 14, 2019, 8:20 p.m. UTC
In some special cases we must not block, but there's not a
spinlock, preempt-off, irqs-off or similar critical section already
that arms the might_sleep() debug checks. Add a non_block_start/end()
pair to annotate these.

This will be used in the oom paths of mmu-notifiers, where blocking is
not allowed to make sure there's forward progress. Quoting Michal:

"The notifier is called from quite a restricted context - oom_reaper -
which shouldn't depend on any locks or sleepable conditionals. The code
should be swift as well but we mostly do care about it to make a forward
progress. Checking for sleepable context is the best thing we could come
up with that would describe these demands at least partially."

Peter also asked whether we want to catch spinlocks on top, but Michal
said those are less of a problem because spinlocks can't have an
indirect dependency upon the page allocator and hence close the loop
with the oom reaper.

Suggested by Michal Hocko.

v2:
- Improve commit message (Michal)
- Also check in schedule, not just might_sleep (Peter)

v3: It works better when I actually squash in the fixup I had lying
around :-/

Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: linux-mm@kvack.org
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Wei Wang <wvw@google.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jannh@google.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: linux-kernel@vger.kernel.org
Acked-by: Christian König <christian.koenig@amd.com> (v1)
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 include/linux/kernel.h | 10 +++++++++-
 include/linux/sched.h  |  4 ++++
 kernel/sched/core.c    | 19 ++++++++++++++-----
 3 files changed, 27 insertions(+), 6 deletions(-)

Comments

Andrew Morton Aug. 14, 2019, 8:45 p.m. UTC | #1
On Wed, 14 Aug 2019 22:20:24 +0200 Daniel Vetter <daniel.vetter@ffwll.ch> wrote:

> In some special cases we must not block, but there's not a
> spinlock, preempt-off, irqs-off or similar critical section already
> that arms the might_sleep() debug checks. Add a non_block_start/end()
> pair to annotate these.
> 
> This will be used in the oom paths of mmu-notifiers, where blocking is
> not allowed to make sure there's forward progress. Quoting Michal:
> 
> "The notifier is called from quite a restricted context - oom_reaper -
> which shouldn't depend on any locks or sleepable conditionals. The code
> should be swift as well but we mostly do care about it to make a forward
> progress. Checking for sleepable context is the best thing we could come
> up with that would describe these demands at least partially."
> 
> Peter also asked whether we want to catch spinlocks on top, but Michal
> said those are less of a problem because spinlocks can't have an
> indirect dependency upon the page allocator and hence close the loop
> with the oom reaper.

I continue to struggle with this.  It introduces a new kernel state
"running preemptibly but must not call schedule()".  How does this make
any sense?

Perhaps a much, much more detailed description of the oom_reaper
situation would help out.
Jason Gunthorpe Aug. 14, 2019, 11:58 p.m. UTC | #2
On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote:
> In some special cases we must not block, but there's not a
> spinlock, preempt-off, irqs-off or similar critical section already
> that arms the might_sleep() debug checks. Add a non_block_start/end()
> pair to annotate these.
> 
> This will be used in the oom paths of mmu-notifiers, where blocking is
> not allowed to make sure there's forward progress. Quoting Michal:
> 
> "The notifier is called from quite a restricted context - oom_reaper -
> which shouldn't depend on any locks or sleepable conditionals. The code
> should be swift as well but we mostly do care about it to make a forward
> progress. Checking for sleepable context is the best thing we could come
> up with that would describe these demands at least partially."

But this describes fs_reclaim_acquire() - is there some reason we are
conflating fs_reclaim with non-sleeping?

ie is there some fundamental difference between the block stack
sleeping during reclaim while it waits for a driver to write out a
page and a GPU driver sleeping during OOM while it waits for it's HW
to fence DMA on a page?

Fundamentally we have invalidate_range_start() vs invalidate_range()
as the start() version is able to sleep. If drivers can do their work
without sleeping then they should be using invalidare_range() instead.

Thus, it doesn't seem to make any sense to ask a driver that requires a
sleeping API not to sleep.

AFAICT what is really going on here is that drivers care about only a
subset of the VA space, and we want to query the driver if it cares
about the range proposed to be OOM'd, so we can OOM ranges that are
do not have SPTEs.

ie if you look pretty much all drivers do exactly as
userptr_mn_invalidate_range_start() does, and bail once they detect
the VA range is of interest.

So, I'm working on a patch to lift the interval tree into the notifier
core and then do the VA test OOM needs without bothering the
driver. Drivers can retain the blocking API they require and OOM can
work on VA's that don't have SPTEs.

This approach also solves the critical bug in this path:
  https://lore.kernel.org/linux-mm/20190807191627.GA3008@ziepe.ca/

And solves a bunch of other bugs in the drivers.

> Peter also asked whether we want to catch spinlocks on top, but Michal
> said those are less of a problem because spinlocks can't have an
> indirect dependency upon the page allocator and hence close the loop
> with the oom reaper.

Again, this entirely sounds like fs_reclaim - isn't that exactly what
it is for?

I have had on my list a second and very related possible bug. I ran
into commit 35cfa2b0b491 ("mm/mmu_notifier: allocate mmu_notifier in
advance") which says that mapping->i_mmap_mutex is under fs_reclaim().

We do hold i_mmap_rwsem while calling invalidate_range_start():

 unmap_mapping_pages
  i_mmap_lock_write(mapping); // ie i_mmap_rwsem
  unmap_mapping_range_tree
   unmap_mapping_range_vma
    zap_page_range_single
     mmu_notifier_invalidate_range_start

So, if it is still true that i_mmap_rwsem is under fs_reclaim then
invalidate_range_start is *always* under fs_reclaim anyhow! (this I do
not know)

Thus we should use lockdep to force this and fix all the drivers.

.. and if we force fs_reclaim always, do we care about blockable
anymore??

Jason
Daniel Vetter Aug. 15, 2019, 6:52 a.m. UTC | #3
On Wed, Aug 14, 2019 at 01:45:58PM -0700, Andrew Morton wrote:
> On Wed, 14 Aug 2019 22:20:24 +0200 Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> 
> > In some special cases we must not block, but there's not a
> > spinlock, preempt-off, irqs-off or similar critical section already
> > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > pair to annotate these.
> > 
> > This will be used in the oom paths of mmu-notifiers, where blocking is
> > not allowed to make sure there's forward progress. Quoting Michal:
> > 
> > "The notifier is called from quite a restricted context - oom_reaper -
> > which shouldn't depend on any locks or sleepable conditionals. The code
> > should be swift as well but we mostly do care about it to make a forward
> > progress. Checking for sleepable context is the best thing we could come
> > up with that would describe these demands at least partially."
> > 
> > Peter also asked whether we want to catch spinlocks on top, but Michal
> > said those are less of a problem because spinlocks can't have an
> > indirect dependency upon the page allocator and hence close the loop
> > with the oom reaper.
> 
> I continue to struggle with this.  It introduces a new kernel state
> "running preemptibly but must not call schedule()".  How does this make
> any sense?
> 
> Perhaps a much, much more detailed description of the oom_reaper
> situation would help out.

I agree on both points, but I guess I'm not the expert to explain why we
have this. All I'm trying to do is that drivers hold up their side. If you
want to have better documentation for why the oom case needs this special
new mode, you're looking at the wrong guy for that :-)

Of course if you folks all decide that you just don't want to be
remembered about that I guess we can drop this one here, but you're just
shooting the messenger I think.
-Daniel
Daniel Vetter Aug. 15, 2019, 6:58 a.m. UTC | #4
On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote:
> On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote:
> > In some special cases we must not block, but there's not a
> > spinlock, preempt-off, irqs-off or similar critical section already
> > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > pair to annotate these.
> > 
> > This will be used in the oom paths of mmu-notifiers, where blocking is
> > not allowed to make sure there's forward progress. Quoting Michal:
> > 
> > "The notifier is called from quite a restricted context - oom_reaper -
> > which shouldn't depend on any locks or sleepable conditionals. The code
> > should be swift as well but we mostly do care about it to make a forward
> > progress. Checking for sleepable context is the best thing we could come
> > up with that would describe these demands at least partially."
> 
> But this describes fs_reclaim_acquire() - is there some reason we are
> conflating fs_reclaim with non-sleeping?

No idea why you tie this into fs_reclaim. We can definitly sleep in there,
and for e.g. kswapd (which also wraps everything in fs_reclaim) we're
event supposed to I thought. To make sure we can get at the last bit of
memory by flushing all the queues and waiting for everything to be cleaned
out.

> ie is there some fundamental difference between the block stack
> sleeping during reclaim while it waits for a driver to write out a
> page and a GPU driver sleeping during OOM while it waits for it's HW
> to fence DMA on a page?
> 
> Fundamentally we have invalidate_range_start() vs invalidate_range()
> as the start() version is able to sleep. If drivers can do their work
> without sleeping then they should be using invalidare_range() instead.
> 
> Thus, it doesn't seem to make any sense to ask a driver that requires a
> sleeping API not to sleep.
> 
> AFAICT what is really going on here is that drivers care about only a
> subset of the VA space, and we want to query the driver if it cares
> about the range proposed to be OOM'd, so we can OOM ranges that are
> do not have SPTEs.
> 
> ie if you look pretty much all drivers do exactly as
> userptr_mn_invalidate_range_start() does, and bail once they detect
> the VA range is of interest.
> 
> So, I'm working on a patch to lift the interval tree into the notifier
> core and then do the VA test OOM needs without bothering the
> driver. Drivers can retain the blocking API they require and OOM can
> work on VA's that don't have SPTEs.

Hm I figured the point of hmm_mirror is to have that interval tree for
everyone (among other things). But yeah lifting to mmu_notifier sounds
like a clean solution for this, but I really have not much clue about why
we even have this for special mode in the oom case. I'm just trying to
increase the odds that drivers hold up their end of the bargain.

> This approach also solves the critical bug in this path:
>   https://lore.kernel.org/linux-mm/20190807191627.GA3008@ziepe.ca/
> 
> And solves a bunch of other bugs in the drivers.
> 
> > Peter also asked whether we want to catch spinlocks on top, but Michal
> > said those are less of a problem because spinlocks can't have an
> > indirect dependency upon the page allocator and hence close the loop
> > with the oom reaper.
> 
> Again, this entirely sounds like fs_reclaim - isn't that exactly what
> it is for?
> 
> I have had on my list a second and very related possible bug. I ran
> into commit 35cfa2b0b491 ("mm/mmu_notifier: allocate mmu_notifier in
> advance") which says that mapping->i_mmap_mutex is under fs_reclaim().
> 
> We do hold i_mmap_rwsem while calling invalidate_range_start():
> 
>  unmap_mapping_pages
>   i_mmap_lock_write(mapping); // ie i_mmap_rwsem
>   unmap_mapping_range_tree
>    unmap_mapping_range_vma
>     zap_page_range_single
>      mmu_notifier_invalidate_range_start
> 
> So, if it is still true that i_mmap_rwsem is under fs_reclaim then
> invalidate_range_start is *always* under fs_reclaim anyhow! (this I do
> not know)
> 
> Thus we should use lockdep to force this and fix all the drivers.
> 
> .. and if we force fs_reclaim always, do we care about blockable
> anymore??

Still not sure what fs_reclaim matters here. One of the later patches
steals the fs_reclaim idea for mmu_notifiers, but that ties together
completely different paths.
-Daniel
Michal Hocko Aug. 15, 2019, 8:44 a.m. UTC | #5
On Wed 14-08-19 13:45:58, Andrew Morton wrote:
> On Wed, 14 Aug 2019 22:20:24 +0200 Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> 
> > In some special cases we must not block, but there's not a
> > spinlock, preempt-off, irqs-off or similar critical section already
> > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > pair to annotate these.
> > 
> > This will be used in the oom paths of mmu-notifiers, where blocking is
> > not allowed to make sure there's forward progress. Quoting Michal:
> > 
> > "The notifier is called from quite a restricted context - oom_reaper -
> > which shouldn't depend on any locks or sleepable conditionals. The code
> > should be swift as well but we mostly do care about it to make a forward
> > progress. Checking for sleepable context is the best thing we could come
> > up with that would describe these demands at least partially."
> > 
> > Peter also asked whether we want to catch spinlocks on top, but Michal
> > said those are less of a problem because spinlocks can't have an
> > indirect dependency upon the page allocator and hence close the loop
> > with the oom reaper.
> 
> I continue to struggle with this.  It introduces a new kernel state
> "running preemptibly but must not call schedule()".  How does this make
> any sense?
> 
> Perhaps a much, much more detailed description of the oom_reaper
> situation would help out.
 
The primary point here is that there is a demand of non blockable mmu
notifiers to be called when the oom reaper tears down the address space.
As the oom reaper is the primary guarantee of the oom handling forward
progress it cannot be blocked on anything that might depend on blockable
memory allocations. These are not really easy to track because they
might be indirect - e.g. notifier blocks on a lock which other context
holds while allocating memory or waiting for a flusher that needs memory
to perform its work. If such a blocking state happens that we can end up
in a silent hang with an unusable machine.
Now we hope for reasonable implementations of mmu notifiers (strong
words I know ;) and this should be relatively simple and effective catch
all tool to detect something suspicious is going on.

Does that make the situation more clear?
Jason Gunthorpe Aug. 15, 2019, 12:23 p.m. UTC | #6
On Thu, Aug 15, 2019 at 08:58:29AM +0200, Daniel Vetter wrote:
> On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote:
> > On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote:
> > > In some special cases we must not block, but there's not a
> > > spinlock, preempt-off, irqs-off or similar critical section already
> > > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > > pair to annotate these.
> > > 
> > > This will be used in the oom paths of mmu-notifiers, where blocking is
> > > not allowed to make sure there's forward progress. Quoting Michal:
> > > 
> > > "The notifier is called from quite a restricted context - oom_reaper -
> > > which shouldn't depend on any locks or sleepable conditionals. The code
> > > should be swift as well but we mostly do care about it to make a forward
> > > progress. Checking for sleepable context is the best thing we could come
> > > up with that would describe these demands at least partially."
> > 
> > But this describes fs_reclaim_acquire() - is there some reason we are
> > conflating fs_reclaim with non-sleeping?
> 
> No idea why you tie this into fs_reclaim. We can definitly sleep in there,
> and for e.g. kswapd (which also wraps everything in fs_reclaim) we're
> event supposed to I thought. To make sure we can get at the last bit of
> memory by flushing all the queues and waiting for everything to be cleaned
> out.

AFAIK the point of fs_reclaim is to prevent "indirect dependency upon
the page allocator" ie a justification that was given this !blockable
stuff.

For instance:

  fs_reclaim_acquire()
  kmalloc(GFP_KERNEL) <- lock dep assertion

And further, Michal's concern about indirectness through locks is also
handled by lockdep:

       CPU0                                 CPU1
                                        mutex_lock()
                                        kmalloc(GFP_KERNEL)
                                        mutex_unlock()
  fs_reclaim_acquire()
  mutex_lock() <- lock dep assertion

In other words, to prevent recursion into the page allocator you use
fs_reclaim_acquire(), and lockdep verfies it in its usual robust way.

I asked Tejun about this once in regards to WQ_MEM_RECLAIM and he
explained that it means you can't call the allocator functions in a
way that would recurse into reclaim (ie instead use instead GFP_ATOMIC, or
tolerate allocation failure, or various other things).

So, the reason I bring it up is half the justifications you posted for
blockable had to do with not recursing into reclaim and deadlocking,
and didn't seem to have much to do with blocking.

I'm asking if *non-blocking* is really the requirement or if this is
just the usual 'do not deadlock on the allocator' thing reclaim paths
alread have?

Jason
Jason Gunthorpe Aug. 15, 2019, 1:04 p.m. UTC | #7
On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote:

> As the oom reaper is the primary guarantee of the oom handling forward
> progress it cannot be blocked on anything that might depend on blockable
> memory allocations. These are not really easy to track because they
> might be indirect - e.g. notifier blocks on a lock which other context
> holds while allocating memory or waiting for a flusher that needs memory
> to perform its work.

But lockdep *does* track all this and fs_reclaim_acquire() was created
to solve exactly this problem.

fs_reclaim is a lock and it flows through all the usual lockdep
schemes like any other lock. Any time the page allocator wants to do
something the would deadlock with reclaim it takes the lock.

Failure is expressed by a deadlock cycle in the lockdep map, and
lockdep can handle arbitary complexity through layers of locks, work
queues, threads, etc.

What is missing?

Jason
Daniel Vetter Aug. 15, 2019, 1:12 p.m. UTC | #8
On Thu, Aug 15, 2019 at 3:04 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote:
>
> > As the oom reaper is the primary guarantee of the oom handling forward
> > progress it cannot be blocked on anything that might depend on blockable
> > memory allocations. These are not really easy to track because they
> > might be indirect - e.g. notifier blocks on a lock which other context
> > holds while allocating memory or waiting for a flusher that needs memory
> > to perform its work.
>
> But lockdep *does* track all this and fs_reclaim_acquire() was created
> to solve exactly this problem.
>
> fs_reclaim is a lock and it flows through all the usual lockdep
> schemes like any other lock. Any time the page allocator wants to do
> something the would deadlock with reclaim it takes the lock.
>
> Failure is expressed by a deadlock cycle in the lockdep map, and
> lockdep can handle arbitary complexity through layers of locks, work
> queues, threads, etc.
>
> What is missing?

Lockdep doens't seen everything by far. E.g. a wait_event will be
caught by the annotations here, but not by lockdep. Plus lockdep does
not see through the wait_event, and doesn't realize if e.g. that event
will never signal because the worker is part of the deadlock loop.
cross-release was supposed to fix that, but seems like that will never
land.

And since we're talking about mmu notifiers here and gpus/dma engines.
We have dma_fence_wait, which can wait for any hw/driver in the system
that takes part in shared/zero-copy buffer processing. Which at least
on the graphics side is everything. This pulls in enormous amounts of
deadlock potential that lockdep simply is blind about and will never
see.

Arming might_sleep catches them all.

Cheers, Daniel

PS: Don't ask me about why we need these semantics for oom_reaper,
like I said just trying to follow the rules.
Michal Hocko Aug. 15, 2019, 1:21 p.m. UTC | #9
On Thu 15-08-19 09:23:44, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 08:58:29AM +0200, Daniel Vetter wrote:
> > On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote:
> > > > In some special cases we must not block, but there's not a
> > > > spinlock, preempt-off, irqs-off or similar critical section already
> > > > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > > > pair to annotate these.
> > > > 
> > > > This will be used in the oom paths of mmu-notifiers, where blocking is
> > > > not allowed to make sure there's forward progress. Quoting Michal:
> > > > 
> > > > "The notifier is called from quite a restricted context - oom_reaper -
> > > > which shouldn't depend on any locks or sleepable conditionals. The code
> > > > should be swift as well but we mostly do care about it to make a forward
> > > > progress. Checking for sleepable context is the best thing we could come
> > > > up with that would describe these demands at least partially."
> > > 
> > > But this describes fs_reclaim_acquire() - is there some reason we are
> > > conflating fs_reclaim with non-sleeping?
> > 
> > No idea why you tie this into fs_reclaim. We can definitly sleep in there,
> > and for e.g. kswapd (which also wraps everything in fs_reclaim) we're
> > event supposed to I thought. To make sure we can get at the last bit of
> > memory by flushing all the queues and waiting for everything to be cleaned
> > out.
> 
> AFAIK the point of fs_reclaim is to prevent "indirect dependency upon
> the page allocator" ie a justification that was given this !blockable
> stuff.
> 
> For instance:
> 
>   fs_reclaim_acquire()
>   kmalloc(GFP_KERNEL) <- lock dep assertion
> 
> And further, Michal's concern about indirectness through locks is also
> handled by lockdep:
> 
>        CPU0                                 CPU1
>                                         mutex_lock()
>                                         kmalloc(GFP_KERNEL)
>                                         mutex_unlock()
>   fs_reclaim_acquire()
>   mutex_lock() <- lock dep assertion
> 
> In other words, to prevent recursion into the page allocator you use
> fs_reclaim_acquire(), and lockdep verfies it in its usual robust way.

fs_reclaim_acquire is about FS/IO recursions IIUC. We are talking about
any !GFP_NOWAIT allocation context here and any {in}direct dependency on
it. Whether fs_reclaim_acquire can be reused for that I do not know
because I am not familiar with the lockdep machinery enough
 
> I asked Tejun about this once in regards to WQ_MEM_RECLAIM and he
> explained that it means you can't call the allocator functions in a
> way that would recurse into reclaim (ie instead use instead GFP_ATOMIC, or
> tolerate allocation failure, or various other things).
> 
> So, the reason I bring it up is half the justifications you posted for
> blockable had to do with not recursing into reclaim and deadlocking,
> and didn't seem to have much to do with blocking.
> 
> I'm asking if *non-blocking* is really the requirement or if this is
> just the usual 'do not deadlock on the allocator' thing reclaim paths
> alread have?

No, non-blocking is a very coarse approximation of what we really need.
But it should give us even a stronger condition. Essentially any sleep
other than a preemption shouldn't be allowed in that context.
Michal Hocko Aug. 15, 2019, 1:24 p.m. UTC | #10
On Thu 15-08-19 10:04:15, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote:
> 
> > As the oom reaper is the primary guarantee of the oom handling forward
> > progress it cannot be blocked on anything that might depend on blockable
> > memory allocations. These are not really easy to track because they
> > might be indirect - e.g. notifier blocks on a lock which other context
> > holds while allocating memory or waiting for a flusher that needs memory
> > to perform its work.
> 
> But lockdep *does* track all this and fs_reclaim_acquire() was created
> to solve exactly this problem.
> 
> fs_reclaim is a lock and it flows through all the usual lockdep
> schemes like any other lock. Any time the page allocator wants to do
> something the would deadlock with reclaim it takes the lock.

Our emails have crossed. Even if fs_reclaim can be re-purposed for other
than FS/IO reclaim contexts which I am not sure about I think that
lockdep is too heavy weight for the purpose of this debugging aid in the
first place. The non block mode is really something as simple as "hey
mmu notifier you are only allowed to do a lightweight processing, if you
cannot guarantee that then just back off". The annotation will give us a
warning if the notifier gets out of this promise.
Jason Gunthorpe Aug. 15, 2019, 2:12 p.m. UTC | #11
On Thu, Aug 15, 2019 at 03:21:27PM +0200, Michal Hocko wrote:
> On Thu 15-08-19 09:23:44, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 08:58:29AM +0200, Daniel Vetter wrote:
> > > On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote:
> > > > On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote:
> > > > > In some special cases we must not block, but there's not a
> > > > > spinlock, preempt-off, irqs-off or similar critical section already
> > > > > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > > > > pair to annotate these.
> > > > > 
> > > > > This will be used in the oom paths of mmu-notifiers, where blocking is
> > > > > not allowed to make sure there's forward progress. Quoting Michal:
> > > > > 
> > > > > "The notifier is called from quite a restricted context - oom_reaper -
> > > > > which shouldn't depend on any locks or sleepable conditionals. The code
> > > > > should be swift as well but we mostly do care about it to make a forward
> > > > > progress. Checking for sleepable context is the best thing we could come
> > > > > up with that would describe these demands at least partially."
> > > > 
> > > > But this describes fs_reclaim_acquire() - is there some reason we are
> > > > conflating fs_reclaim with non-sleeping?
> > > 
> > > No idea why you tie this into fs_reclaim. We can definitly sleep in there,
> > > and for e.g. kswapd (which also wraps everything in fs_reclaim) we're
> > > event supposed to I thought. To make sure we can get at the last bit of
> > > memory by flushing all the queues and waiting for everything to be cleaned
> > > out.
> > 
> > AFAIK the point of fs_reclaim is to prevent "indirect dependency upon
> > the page allocator" ie a justification that was given this !blockable
> > stuff.
> > 
> > For instance:
> > 
> >   fs_reclaim_acquire()
> >   kmalloc(GFP_KERNEL) <- lock dep assertion
> > 
> > And further, Michal's concern about indirectness through locks is also
> > handled by lockdep:
> > 
> >        CPU0                                 CPU1
> >                                         mutex_lock()
> >                                         kmalloc(GFP_KERNEL)
> >                                         mutex_unlock()
> >   fs_reclaim_acquire()
> >   mutex_lock() <- lock dep assertion
> > 
> > In other words, to prevent recursion into the page allocator you use
> > fs_reclaim_acquire(), and lockdep verfies it in its usual robust way.
> 
> fs_reclaim_acquire is about FS/IO recursions IIUC. We are talking about
> any !GFP_NOWAIT allocation context here and any {in}direct dependency on
> it. 

AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
__GFP_DIRECT_RECLAIM..

This matches the existing test in __need_fs_reclaim() - so if you are
OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
allocations during OOM, then I think fs_reclaim already matches what
you described?

> Whether fs_reclaim_acquire can be reused for that I do not know
> because I am not familiar with the lockdep machinery enough

Well, if fs_reclaim is not already testing the flags you want, then we
could add another lockdep map that does. The basic principle is the
same, if you want to detect and prevent recursion into the allocator
under certain GFP flags then then AFAIK lockdep is the best tool we
have.

> No, non-blocking is a very coarse approximation of what we really need.
> But it should give us even a stronger condition. Essentially any sleep
> other than a preemption shouldn't be allowed in that context.

But it is a nonsense API to give the driver invalidate_range_start,
the blocking alternative to the non-blocking invalidate_range and then
demand it to be non-blocking.

Inspecting the code, no drivers are actually able to progress their
side in non-blocking mode.

The best we got was drivers tested the VA range and returned success
if they had no interest. Which is a big win to be sure, but it looks
like getting any more is not really posssible.

However, we could (probably even should) make the drivers fs_reclaim
safe.

If that is enough to guarantee progress of OOM, then lets consider
something like using current_gfp_context() to force PF_MEMALLOC_NOFS
allocation behavior on the driver callback and lockdep to try and keep
pushing on the the debugging, and dropping !blocking.

Jason
Jason Gunthorpe Aug. 15, 2019, 2:37 p.m. UTC | #12
On Thu, Aug 15, 2019 at 03:12:11PM +0200, Daniel Vetter wrote:
> On Thu, Aug 15, 2019 at 3:04 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote:
> >
> > > As the oom reaper is the primary guarantee of the oom handling forward
> > > progress it cannot be blocked on anything that might depend on blockable
> > > memory allocations. These are not really easy to track because they
> > > might be indirect - e.g. notifier blocks on a lock which other context
> > > holds while allocating memory or waiting for a flusher that needs memory
> > > to perform its work.
> >
> > But lockdep *does* track all this and fs_reclaim_acquire() was created
> > to solve exactly this problem.
> >
> > fs_reclaim is a lock and it flows through all the usual lockdep
> > schemes like any other lock. Any time the page allocator wants to do
> > something the would deadlock with reclaim it takes the lock.
> >
> > Failure is expressed by a deadlock cycle in the lockdep map, and
> > lockdep can handle arbitary complexity through layers of locks, work
> > queues, threads, etc.
> >
> > What is missing?
> 
> Lockdep doens't seen everything by far. E.g. a wait_event will be
> caught by the annotations here, but not by lockdep. 

Sure, but the wait_event might be OK if its progress isn't contingent
on fs_reclaim, ie triggered from interrupt, so why ban it?

> And since we're talking about mmu notifiers here and gpus/dma engines.
> We have dma_fence_wait, which can wait for any hw/driver in the system
> that takes part in shared/zero-copy buffer processing. Which at least
> on the graphics side is everything. This pulls in enormous amounts of
> deadlock potential that lockdep simply is blind about and will never
> see.

It seems very risky to entagle a notifier widely like that.

It looks pretty sure that notifiers are fs_reclaim, so at a minimum
that wait_event can't be contingent on anything that is doing
GFP_KERNEL or it will deadlock - and blockable doesn't make that sleep
safe.

Avoiding an uncertain wait_event under notifiers would seem to be the
only reasonable design here..

Jason
Daniel Vetter Aug. 15, 2019, 2:43 p.m. UTC | #13
On Thu, Aug 15, 2019 at 4:38 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Aug 15, 2019 at 03:12:11PM +0200, Daniel Vetter wrote:
> > On Thu, Aug 15, 2019 at 3:04 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote:
> > >
> > > > As the oom reaper is the primary guarantee of the oom handling forward
> > > > progress it cannot be blocked on anything that might depend on blockable
> > > > memory allocations. These are not really easy to track because they
> > > > might be indirect - e.g. notifier blocks on a lock which other context
> > > > holds while allocating memory or waiting for a flusher that needs memory
> > > > to perform its work.
> > >
> > > But lockdep *does* track all this and fs_reclaim_acquire() was created
> > > to solve exactly this problem.
> > >
> > > fs_reclaim is a lock and it flows through all the usual lockdep
> > > schemes like any other lock. Any time the page allocator wants to do
> > > something the would deadlock with reclaim it takes the lock.
> > >
> > > Failure is expressed by a deadlock cycle in the lockdep map, and
> > > lockdep can handle arbitary complexity through layers of locks, work
> > > queues, threads, etc.
> > >
> > > What is missing?
> >
> > Lockdep doens't seen everything by far. E.g. a wait_event will be
> > caught by the annotations here, but not by lockdep.
>
> Sure, but the wait_event might be OK if its progress isn't contingent
> on fs_reclaim, ie triggered from interrupt, so why ban it?

For normal notifiers sure (but lockdep won't help you proof that at
all). For oom_reaper apparently not, but that's really Michal's thing,
I have not much idea about that.

> > And since we're talking about mmu notifiers here and gpus/dma engines.
> > We have dma_fence_wait, which can wait for any hw/driver in the system
> > that takes part in shared/zero-copy buffer processing. Which at least
> > on the graphics side is everything. This pulls in enormous amounts of
> > deadlock potential that lockdep simply is blind about and will never
> > see.
>
> It seems very risky to entagle a notifier widely like that.

That's why I've looked into all possible ways to annotate them with
debug infrastructure.

> It looks pretty sure that notifiers are fs_reclaim, so at a minimum
> that wait_event can't be contingent on anything that is doing
> GFP_KERNEL or it will deadlock - and blockable doesn't make that sleep
> safe.
>
> Avoiding an uncertain wait_event under notifiers would seem to be the
> only reasonable design here..

You have to wait for the gpu to finnish current processing in
invalidate_range_start. Otherwise there's no point to any of this
really. So the wait_event/dma_fence_wait are unavoidable really.

That's also why I'm throwing in the lockdep annotation on top, and why
it would be really nice if we somehow can get the cross-release work
landed. But it looks like all the people who could make it happen are
busy with smeltdown :-/
-Daniel
Jason Gunthorpe Aug. 15, 2019, 3:10 p.m. UTC | #14
On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:

> You have to wait for the gpu to finnish current processing in
> invalidate_range_start. Otherwise there's no point to any of this
> really. So the wait_event/dma_fence_wait are unavoidable really.

I don't envy your task :|

But, what you describe sure sounds like a 'registration cache' model,
not the 'shadow pte' model of coherency.

The key difference is that a regirstationcache is allowed to become
incoherent with the VMA's because it holds page pins. It is a
programming bug in userspace to change VA mappings via mmap/munmap/etc
while the device is working on that VA, but it does not harm system
integrity because of the page pin.

The cache ensures that each initiated operation sees a DMA setup that
matches the current VA map when the operation is initiated and allows
expensive device DMA setups to be re-used.

A 'shadow pte' model (ie hmm) *really* needs device support to
directly block DMA access - ie trigger 'device page fault'. ie the
invalidate_start should inform the device to enter a fault mode and
that is it.  If the device can't do that, then the driver probably
shouldn't persue this level of coherency. The driver would quickly get
into the messy locking problems like dma_fence_wait from a notifier.

It is important to identify what model you are going for as defining a
'registration cache' coherence expectation allows the driver to skip
blocking in invalidate_range_start. All it does is invalidate the
cache so that future operations pick up the new VA mapping.

Intel's HFI RDMA driver uses this model extensively, and I think it is
well proven, within some limitations of course.

At least, 'registration cache' is the only use model I know of where
it is acceptable to skip invalidate_range_end.

Jason
Michal Hocko Aug. 15, 2019, 4 p.m. UTC | #15
On Thu 15-08-19 11:12:19, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 03:21:27PM +0200, Michal Hocko wrote:
> > On Thu 15-08-19 09:23:44, Jason Gunthorpe wrote:
> > > On Thu, Aug 15, 2019 at 08:58:29AM +0200, Daniel Vetter wrote:
> > > > On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote:
> > > > > On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote:
> > > > > > In some special cases we must not block, but there's not a
> > > > > > spinlock, preempt-off, irqs-off or similar critical section already
> > > > > > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > > > > > pair to annotate these.
> > > > > > 
> > > > > > This will be used in the oom paths of mmu-notifiers, where blocking is
> > > > > > not allowed to make sure there's forward progress. Quoting Michal:
> > > > > > 
> > > > > > "The notifier is called from quite a restricted context - oom_reaper -
> > > > > > which shouldn't depend on any locks or sleepable conditionals. The code
> > > > > > should be swift as well but we mostly do care about it to make a forward
> > > > > > progress. Checking for sleepable context is the best thing we could come
> > > > > > up with that would describe these demands at least partially."
> > > > > 
> > > > > But this describes fs_reclaim_acquire() - is there some reason we are
> > > > > conflating fs_reclaim with non-sleeping?
> > > > 
> > > > No idea why you tie this into fs_reclaim. We can definitly sleep in there,
> > > > and for e.g. kswapd (which also wraps everything in fs_reclaim) we're
> > > > event supposed to I thought. To make sure we can get at the last bit of
> > > > memory by flushing all the queues and waiting for everything to be cleaned
> > > > out.
> > > 
> > > AFAIK the point of fs_reclaim is to prevent "indirect dependency upon
> > > the page allocator" ie a justification that was given this !blockable
> > > stuff.
> > > 
> > > For instance:
> > > 
> > >   fs_reclaim_acquire()
> > >   kmalloc(GFP_KERNEL) <- lock dep assertion
> > > 
> > > And further, Michal's concern about indirectness through locks is also
> > > handled by lockdep:
> > > 
> > >        CPU0                                 CPU1
> > >                                         mutex_lock()
> > >                                         kmalloc(GFP_KERNEL)
> > >                                         mutex_unlock()
> > >   fs_reclaim_acquire()
> > >   mutex_lock() <- lock dep assertion
> > > 
> > > In other words, to prevent recursion into the page allocator you use
> > > fs_reclaim_acquire(), and lockdep verfies it in its usual robust way.
> > 
> > fs_reclaim_acquire is about FS/IO recursions IIUC. We are talking about
> > any !GFP_NOWAIT allocation context here and any {in}direct dependency on
> > it. 
> 
> AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> __GFP_DIRECT_RECLAIM..
>
> This matches the existing test in __need_fs_reclaim() - so if you are
> OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> allocations during OOM, then I think fs_reclaim already matches what
> you described?

No GFP_NOFS is equally bad. Please read my other email explaining what
the oom_reaper actually requires. In short no blocking on direct or
indirect dependecy on memory allocation that might sleep. If you can
express that in the existing lockdep machinery. All fine. But then
consider deployments where lockdep is no-no because of the overhead.

> > No, non-blocking is a very coarse approximation of what we really need.
> > But it should give us even a stronger condition. Essentially any sleep
> > other than a preemption shouldn't be allowed in that context.
> 
> But it is a nonsense API to give the driver invalidate_range_start,
> the blocking alternative to the non-blocking invalidate_range and then
> demand it to be non-blocking.
> 
> Inspecting the code, no drivers are actually able to progress their
> side in non-blocking mode.
> 
> The best we got was drivers tested the VA range and returned success
> if they had no interest. Which is a big win to be sure, but it looks
> like getting any more is not really posssible.

And that is already a great win! Because many notifiers only do care
about particular mappings. Please note that backing off unconditioanlly
will simply cause that the oom reaper will have to back off not doing
any tear down anything.

> However, we could (probably even should) make the drivers fs_reclaim
> safe.
> 
> If that is enough to guarantee progress of OOM, then lets consider
> something like using current_gfp_context() to force PF_MEMALLOC_NOFS
> allocation behavior on the driver callback and lockdep to try and keep
> pushing on the the debugging, and dropping !blocking.

How are you going to enforce indirect dependency? E.g. a lock that is
also used in other context which depend on sleepable memory allocation
to move forward.

Really, this was aimed to give a very simple debugging aid. If it is
considered to be so controversial, even though I really do not see why,
then let's just drop it on the floor.
Daniel Vetter Aug. 15, 2019, 4:25 p.m. UTC | #16
On Thu, Aug 15, 2019 at 5:10 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:
>
> > You have to wait for the gpu to finnish current processing in
> > invalidate_range_start. Otherwise there's no point to any of this
> > really. So the wait_event/dma_fence_wait are unavoidable really.
>
> I don't envy your task :|
>
> But, what you describe sure sounds like a 'registration cache' model,
> not the 'shadow pte' model of coherency.
>
> The key difference is that a regirstationcache is allowed to become
> incoherent with the VMA's because it holds page pins. It is a
> programming bug in userspace to change VA mappings via mmap/munmap/etc
> while the device is working on that VA, but it does not harm system
> integrity because of the page pin.
>
> The cache ensures that each initiated operation sees a DMA setup that
> matches the current VA map when the operation is initiated and allows
> expensive device DMA setups to be re-used.
>
> A 'shadow pte' model (ie hmm) *really* needs device support to
> directly block DMA access - ie trigger 'device page fault'. ie the
> invalidate_start should inform the device to enter a fault mode and
> that is it.  If the device can't do that, then the driver probably
> shouldn't persue this level of coherency. The driver would quickly get
> into the messy locking problems like dma_fence_wait from a notifier.
>
> It is important to identify what model you are going for as defining a
> 'registration cache' coherence expectation allows the driver to skip
> blocking in invalidate_range_start. All it does is invalidate the
> cache so that future operations pick up the new VA mapping.
>
> Intel's HFI RDMA driver uses this model extensively, and I think it is
> well proven, within some limitations of course.
>
> At least, 'registration cache' is the only use model I know of where
> it is acceptable to skip invalidate_range_end.

I'm not really well versed in the details of our userptr, but both
amdgpu and i915 wait for the gpu to complete from
invalidate_range_start. Jerome has at least looked a lot at the amdgpu
one, so maybe he can explain what exactly it is we're doing ...
-Daniel
Jerome Glisse Aug. 15, 2019, 4:32 p.m. UTC | #17
On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:
> 
> > You have to wait for the gpu to finnish current processing in
> > invalidate_range_start. Otherwise there's no point to any of this
> > really. So the wait_event/dma_fence_wait are unavoidable really.
> 
> I don't envy your task :|
> 
> But, what you describe sure sounds like a 'registration cache' model,
> not the 'shadow pte' model of coherency.
> 
> The key difference is that a regirstationcache is allowed to become
> incoherent with the VMA's because it holds page pins. It is a
> programming bug in userspace to change VA mappings via mmap/munmap/etc
> while the device is working on that VA, but it does not harm system
> integrity because of the page pin.
> 
> The cache ensures that each initiated operation sees a DMA setup that
> matches the current VA map when the operation is initiated and allows
> expensive device DMA setups to be re-used.
> 
> A 'shadow pte' model (ie hmm) *really* needs device support to
> directly block DMA access - ie trigger 'device page fault'. ie the
> invalidate_start should inform the device to enter a fault mode and
> that is it.  If the device can't do that, then the driver probably
> shouldn't persue this level of coherency. The driver would quickly get
> into the messy locking problems like dma_fence_wait from a notifier.

I think here we do not agree on the hardware requirement. For GPU
we will always need to be able to wait for some GPU fence from inside
the notifier callback, there is just no way around that for many of
the GPUs today (i do not see any indication of that changing).

Driver should avoid lock complexity by using wait queue so that the
driver notifier callback can wait without having to hold some driver
lock. However there will be at least one lock needed to update the
internal driver state for the range being invalidated. That lock is
just the device driver page table lock for the GPU page table
associated with the mm_struct. In all GPU driver so far it is a short
lived lock and nothing blocking is done while holding it (it is just
about updating page table directory really wether it is filling it or
clearing it).


> 
> It is important to identify what model you are going for as defining a
> 'registration cache' coherence expectation allows the driver to skip
> blocking in invalidate_range_start. All it does is invalidate the
> cache so that future operations pick up the new VA mapping.
> 
> Intel's HFI RDMA driver uses this model extensively, and I think it is
> well proven, within some limitations of course.
> 
> At least, 'registration cache' is the only use model I know of where
> it is acceptable to skip invalidate_range_end.

Here GPU are not in the registration cache model, i know it might looks
like it because of GUP but GUP was use just because hmm did not exist
at the time.

Cheers,
Jérôme
Jason Gunthorpe Aug. 15, 2019, 4:56 p.m. UTC | #18
On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:

> > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > __GFP_DIRECT_RECLAIM..
> >
> > This matches the existing test in __need_fs_reclaim() - so if you are
> > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > allocations during OOM, then I think fs_reclaim already matches what
> > you described?
> 
> No GFP_NOFS is equally bad. Please read my other email explaining what
> the oom_reaper actually requires. In short no blocking on direct or
> indirect dependecy on memory allocation that might sleep.

It is much easier to follow with some hints on code, so the true
requirement is that the OOM repear not block on GFP_FS and GFP_IO
allocations, great, that constraint is now clear.

> If you can express that in the existing lockdep machinery. All
> fine. But then consider deployments where lockdep is no-no because
> of the overhead.

This is all for driver debugging. The point of lockdep is to find all
these paths without have to hit them as actual races, using debug
kernels.

I don't think we need this kind of debugging on production kernels?

> > The best we got was drivers tested the VA range and returned success
> > if they had no interest. Which is a big win to be sure, but it looks
> > like getting any more is not really posssible.
> 
> And that is already a great win! Because many notifiers only do care
> about particular mappings. Please note that backing off unconditioanlly
> will simply cause that the oom reaper will have to back off not doing
> any tear down anything.

Well, I'm working to propose that we do the VA range test under core
mmu notifier code that cannot block and then we simply remove the idea
of blockable from drivers using this new 'range notifier'. 

I think this pretty much solves the concern?

> > However, we could (probably even should) make the drivers fs_reclaim
> > safe.
> > 
> > If that is enough to guarantee progress of OOM, then lets consider
> > something like using current_gfp_context() to force PF_MEMALLOC_NOFS
> > allocation behavior on the driver callback and lockdep to try and keep
> > pushing on the the debugging, and dropping !blocking.
> 
> How are you going to enforce indirect dependency? E.g. a lock that is
> also used in other context which depend on sleepable memory allocation
> to move forward.

You mean like this:

       CPU0                                 CPU1
                                        mutex_lock()
                                        kmalloc(GFP_KERNEL)
                                        mutex_unlock()
  fs_reclaim_acquire()
  mutex_lock() <- illegal: lock dep assertion

?

lockdep handles this - that is what it does, it builds a graph of all
lock dependencies and requires the graph to be acyclic. So mutex_lock
depends on fs_reclaim_lock on CPU1 while on CPU0, fs_reclaim_lock
depends on mutex_lock. This is an ABBA locking cycle and lockdep will
reliably trigger.

So, if we wanted to do this, I'd probably suggest we retool
fs_reclaim's interface be more like

  prevent_gfp_flags(__GFP_IO | __GFP_FS);
  [..]
  restore_gfp_flags(__GFP_IO | __GFP_FS);

Which is lockdep based and follows the indirect lock dependencies you
talked about.

Then OOM and reclaim can specify the different flags they want
blocked.  We could probably use the same API with WQ_MEM_RECLAIM as I
was chatting with Tejun about:

https://www.spinics.net/lists/linux-rdma/msg77362.html

IMHO this stuff is *super complicated* for those of us outside the mm
subsystem, so having some really clear annotations like the above
would go a long way to help understand these special constraints.

I'm pretty sure there are lots of driver bugs related to using the
wrong GFP flags in the kernel.

> Really, this was aimed to give a very simple debugging aid. If it is
> considered to be so controversial, even though I really do not see why,
> then let's just drop it on the floor.

My concern is that the requirement was very unclear. I'm trying to
understand all the bits of how these notifiers work and the exact
semantic of this OOM path have been vauge if it is really some GFP
flag restriction or truely related to not sleeping.

Jason
Jerome Glisse Aug. 15, 2019, 5:11 p.m. UTC | #19
On Thu, Aug 15, 2019 at 01:56:31PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:
> 
> > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > > __GFP_DIRECT_RECLAIM..
> > >
> > > This matches the existing test in __need_fs_reclaim() - so if you are
> > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > > allocations during OOM, then I think fs_reclaim already matches what
> > > you described?
> > 
> > No GFP_NOFS is equally bad. Please read my other email explaining what
> > the oom_reaper actually requires. In short no blocking on direct or
> > indirect dependecy on memory allocation that might sleep.
> 
> It is much easier to follow with some hints on code, so the true
> requirement is that the OOM repear not block on GFP_FS and GFP_IO
> allocations, great, that constraint is now clear.
> 
> > If you can express that in the existing lockdep machinery. All
> > fine. But then consider deployments where lockdep is no-no because
> > of the overhead.
> 
> This is all for driver debugging. The point of lockdep is to find all
> these paths without have to hit them as actual races, using debug
> kernels.
> 
> I don't think we need this kind of debugging on production kernels?
> 
> > > The best we got was drivers tested the VA range and returned success
> > > if they had no interest. Which is a big win to be sure, but it looks
> > > like getting any more is not really posssible.
> > 
> > And that is already a great win! Because many notifiers only do care
> > about particular mappings. Please note that backing off unconditioanlly
> > will simply cause that the oom reaper will have to back off not doing
> > any tear down anything.
> 
> Well, I'm working to propose that we do the VA range test under core
> mmu notifier code that cannot block and then we simply remove the idea
> of blockable from drivers using this new 'range notifier'. 
> 
> I think this pretty much solves the concern?

I am not sure i follow what you propose here ? Like i pointed out in
another email for GPU we do need to be able to sleep (we might get
lucky and not need too but this is runtime thing) within notifier
range_start callback. This has been something allow by notifier since
it has been introduced in the kernel.

Cheers,
Jérôme
Jason Gunthorpe Aug. 15, 2019, 5:16 p.m. UTC | #20
On Thu, Aug 15, 2019 at 12:32:38PM -0400, Jerome Glisse wrote:
> On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:
> > 
> > > You have to wait for the gpu to finnish current processing in
> > > invalidate_range_start. Otherwise there's no point to any of this
> > > really. So the wait_event/dma_fence_wait are unavoidable really.
> > 
> > I don't envy your task :|
> > 
> > But, what you describe sure sounds like a 'registration cache' model,
> > not the 'shadow pte' model of coherency.
> > 
> > The key difference is that a regirstationcache is allowed to become
> > incoherent with the VMA's because it holds page pins. It is a
> > programming bug in userspace to change VA mappings via mmap/munmap/etc
> > while the device is working on that VA, but it does not harm system
> > integrity because of the page pin.
> > 
> > The cache ensures that each initiated operation sees a DMA setup that
> > matches the current VA map when the operation is initiated and allows
> > expensive device DMA setups to be re-used.
> > 
> > A 'shadow pte' model (ie hmm) *really* needs device support to
> > directly block DMA access - ie trigger 'device page fault'. ie the
> > invalidate_start should inform the device to enter a fault mode and
> > that is it.  If the device can't do that, then the driver probably
> > shouldn't persue this level of coherency. The driver would quickly get
> > into the messy locking problems like dma_fence_wait from a notifier.
> 
> I think here we do not agree on the hardware requirement. For GPU
> we will always need to be able to wait for some GPU fence from inside
> the notifier callback, there is just no way around that for many of
> the GPUs today (i do not see any indication of that changing).

I didn't say you couldn't wait, I was trying to say that the wait
should only be contigent on the HW itself. Ie you can wait on a GPU
page table lock, and you can wait on a GPU page table flush completion
via IRQ.

What is troubling is to wait till some other thread gets a GPU command
completion and decr's a kref on the DMA buffer - which kinda looks
like what this dma_fence() stuff is all about. A driver like that
would have to be super careful to ensure consistent forward progress
toward dma ref == 0 when the system is under reclaim. 

ie by running it's entire IRQ flow under fs_reclaim locking.

> associated with the mm_struct. In all GPU driver so far it is a short
> lived lock and nothing blocking is done while holding it (it is just
> about updating page table directory really wether it is filling it or
> clearing it).

The main blocking I expect in a shadow PTE flow is waiting for the HW
to complete invalidations of its PTE cache.

> > It is important to identify what model you are going for as defining a
> > 'registration cache' coherence expectation allows the driver to skip
> > blocking in invalidate_range_start. All it does is invalidate the
> > cache so that future operations pick up the new VA mapping.
> > 
> > Intel's HFI RDMA driver uses this model extensively, and I think it is
> > well proven, within some limitations of course.
> > 
> > At least, 'registration cache' is the only use model I know of where
> > it is acceptable to skip invalidate_range_end.
> 
> Here GPU are not in the registration cache model, i know it might looks
> like it because of GUP but GUP was use just because hmm did not exist
> at the time.

It is not because of GUP, it is because of the lack of
invalidate_range_end. A driver cannot correctly implement the SPTE
model without invalidate_range_end, even if it holds the page pins via
GUP.

So, I've been assuming the few drivers without invalidate_range_end
are trying to do registration caching, rather than assuming they are
broken.

Jason
Jason Gunthorpe Aug. 15, 2019, 5:17 p.m. UTC | #21
On Thu, Aug 15, 2019 at 01:11:56PM -0400, Jerome Glisse wrote:
> On Thu, Aug 15, 2019 at 01:56:31PM -0300, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:
> > 
> > > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > > > __GFP_DIRECT_RECLAIM..
> > > >
> > > > This matches the existing test in __need_fs_reclaim() - so if you are
> > > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > > > allocations during OOM, then I think fs_reclaim already matches what
> > > > you described?
> > > 
> > > No GFP_NOFS is equally bad. Please read my other email explaining what
> > > the oom_reaper actually requires. In short no blocking on direct or
> > > indirect dependecy on memory allocation that might sleep.
> > 
> > It is much easier to follow with some hints on code, so the true
> > requirement is that the OOM repear not block on GFP_FS and GFP_IO
> > allocations, great, that constraint is now clear.
> > 
> > > If you can express that in the existing lockdep machinery. All
> > > fine. But then consider deployments where lockdep is no-no because
> > > of the overhead.
> > 
> > This is all for driver debugging. The point of lockdep is to find all
> > these paths without have to hit them as actual races, using debug
> > kernels.
> > 
> > I don't think we need this kind of debugging on production kernels?
> > 
> > > > The best we got was drivers tested the VA range and returned success
> > > > if they had no interest. Which is a big win to be sure, but it looks
> > > > like getting any more is not really posssible.
> > > 
> > > And that is already a great win! Because many notifiers only do care
> > > about particular mappings. Please note that backing off unconditioanlly
> > > will simply cause that the oom reaper will have to back off not doing
> > > any tear down anything.
> > 
> > Well, I'm working to propose that we do the VA range test under core
> > mmu notifier code that cannot block and then we simply remove the idea
> > of blockable from drivers using this new 'range notifier'. 
> > 
> > I think this pretty much solves the concern?
> 
> I am not sure i follow what you propose here ? Like i pointed out in
> another email for GPU we do need to be able to sleep (we might get
> lucky and not need too but this is runtime thing) within notifier
> range_start callback. This has been something allow by notifier since
> it has been introduced in the kernel.

Sorry, I mean remove the idea of the blockable flag from the
drivers. Drivers will always be able to block, within the existing
limitation of fs_reclaim

Jason
Daniel Vetter Aug. 15, 2019, 5:21 p.m. UTC | #22
On Thu, Aug 15, 2019 at 7:16 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Aug 15, 2019 at 12:32:38PM -0400, Jerome Glisse wrote:
> > On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:
> > >
> > > > You have to wait for the gpu to finnish current processing in
> > > > invalidate_range_start. Otherwise there's no point to any of this
> > > > really. So the wait_event/dma_fence_wait are unavoidable really.
> > >
> > > I don't envy your task :|
> > >
> > > But, what you describe sure sounds like a 'registration cache' model,
> > > not the 'shadow pte' model of coherency.
> > >
> > > The key difference is that a regirstationcache is allowed to become
> > > incoherent with the VMA's because it holds page pins. It is a
> > > programming bug in userspace to change VA mappings via mmap/munmap/etc
> > > while the device is working on that VA, but it does not harm system
> > > integrity because of the page pin.
> > >
> > > The cache ensures that each initiated operation sees a DMA setup that
> > > matches the current VA map when the operation is initiated and allows
> > > expensive device DMA setups to be re-used.
> > >
> > > A 'shadow pte' model (ie hmm) *really* needs device support to
> > > directly block DMA access - ie trigger 'device page fault'. ie the
> > > invalidate_start should inform the device to enter a fault mode and
> > > that is it.  If the device can't do that, then the driver probably
> > > shouldn't persue this level of coherency. The driver would quickly get
> > > into the messy locking problems like dma_fence_wait from a notifier.
> >
> > I think here we do not agree on the hardware requirement. For GPU
> > we will always need to be able to wait for some GPU fence from inside
> > the notifier callback, there is just no way around that for many of
> > the GPUs today (i do not see any indication of that changing).
>
> I didn't say you couldn't wait, I was trying to say that the wait
> should only be contigent on the HW itself. Ie you can wait on a GPU
> page table lock, and you can wait on a GPU page table flush completion
> via IRQ.
>
> What is troubling is to wait till some other thread gets a GPU command
> completion and decr's a kref on the DMA buffer - which kinda looks
> like what this dma_fence() stuff is all about. A driver like that
> would have to be super careful to ensure consistent forward progress
> toward dma ref == 0 when the system is under reclaim.
>
> ie by running it's entire IRQ flow under fs_reclaim locking.

This is correct. At least for i915 it's already a required due to our
shrinker also having to do the same. I think amdgpu isn't bothering
with that since they have vram for most of the stuff, and just limit
system memory usage to half of all and forgo the shrinker. Probably
not the nicest approach. Anyway, both do the same mmu_notifier dance,
just want to explain that we've been living with this for longer
already.

So yeah writing a gpu driver is not easy.

> > associated with the mm_struct. In all GPU driver so far it is a short
> > lived lock and nothing blocking is done while holding it (it is just
> > about updating page table directory really wether it is filling it or
> > clearing it).
>
> The main blocking I expect in a shadow PTE flow is waiting for the HW
> to complete invalidations of its PTE cache.
>
> > > It is important to identify what model you are going for as defining a
> > > 'registration cache' coherence expectation allows the driver to skip
> > > blocking in invalidate_range_start. All it does is invalidate the
> > > cache so that future operations pick up the new VA mapping.
> > >
> > > Intel's HFI RDMA driver uses this model extensively, and I think it is
> > > well proven, within some limitations of course.
> > >
> > > At least, 'registration cache' is the only use model I know of where
> > > it is acceptable to skip invalidate_range_end.
> >
> > Here GPU are not in the registration cache model, i know it might looks
> > like it because of GUP but GUP was use just because hmm did not exist
> > at the time.
>
> It is not because of GUP, it is because of the lack of
> invalidate_range_end. A driver cannot correctly implement the SPTE
> model without invalidate_range_end, even if it holds the page pins via
> GUP.
>
> So, I've been assuming the few drivers without invalidate_range_end
> are trying to do registration caching, rather than assuming they are
> broken.

I915 might just be broken. amdgpu does the full thing, using
hmm_mirror. But still with dma_fence_wait.
-Daniel
Jerome Glisse Aug. 15, 2019, 5:35 p.m. UTC | #23
On Thu, Aug 15, 2019 at 07:21:47PM +0200, Daniel Vetter wrote:
> On Thu, Aug 15, 2019 at 7:16 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Thu, Aug 15, 2019 at 12:32:38PM -0400, Jerome Glisse wrote:
> > > On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote:
> > > > On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:
> > > >
> > > > > You have to wait for the gpu to finnish current processing in
> > > > > invalidate_range_start. Otherwise there's no point to any of this
> > > > > really. So the wait_event/dma_fence_wait are unavoidable really.
> > > >
> > > > I don't envy your task :|
> > > >
> > > > But, what you describe sure sounds like a 'registration cache' model,
> > > > not the 'shadow pte' model of coherency.
> > > >
> > > > The key difference is that a regirstationcache is allowed to become
> > > > incoherent with the VMA's because it holds page pins. It is a
> > > > programming bug in userspace to change VA mappings via mmap/munmap/etc
> > > > while the device is working on that VA, but it does not harm system
> > > > integrity because of the page pin.
> > > >
> > > > The cache ensures that each initiated operation sees a DMA setup that
> > > > matches the current VA map when the operation is initiated and allows
> > > > expensive device DMA setups to be re-used.
> > > >
> > > > A 'shadow pte' model (ie hmm) *really* needs device support to
> > > > directly block DMA access - ie trigger 'device page fault'. ie the
> > > > invalidate_start should inform the device to enter a fault mode and
> > > > that is it.  If the device can't do that, then the driver probably
> > > > shouldn't persue this level of coherency. The driver would quickly get
> > > > into the messy locking problems like dma_fence_wait from a notifier.
> > >
> > > I think here we do not agree on the hardware requirement. For GPU
> > > we will always need to be able to wait for some GPU fence from inside
> > > the notifier callback, there is just no way around that for many of
> > > the GPUs today (i do not see any indication of that changing).
> >
> > I didn't say you couldn't wait, I was trying to say that the wait
> > should only be contigent on the HW itself. Ie you can wait on a GPU
> > page table lock, and you can wait on a GPU page table flush completion
> > via IRQ.
> >
> > What is troubling is to wait till some other thread gets a GPU command
> > completion and decr's a kref on the DMA buffer - which kinda looks
> > like what this dma_fence() stuff is all about. A driver like that
> > would have to be super careful to ensure consistent forward progress
> > toward dma ref == 0 when the system is under reclaim.
> >
> > ie by running it's entire IRQ flow under fs_reclaim locking.
> 
> This is correct. At least for i915 it's already a required due to our
> shrinker also having to do the same. I think amdgpu isn't bothering
> with that since they have vram for most of the stuff, and just limit
> system memory usage to half of all and forgo the shrinker. Probably
> not the nicest approach. Anyway, both do the same mmu_notifier dance,
> just want to explain that we've been living with this for longer
> already.
> 
> So yeah writing a gpu driver is not easy.
> 
> > > associated with the mm_struct. In all GPU driver so far it is a short
> > > lived lock and nothing blocking is done while holding it (it is just
> > > about updating page table directory really wether it is filling it or
> > > clearing it).
> >
> > The main blocking I expect in a shadow PTE flow is waiting for the HW
> > to complete invalidations of its PTE cache.
> >
> > > > It is important to identify what model you are going for as defining a
> > > > 'registration cache' coherence expectation allows the driver to skip
> > > > blocking in invalidate_range_start. All it does is invalidate the
> > > > cache so that future operations pick up the new VA mapping.
> > > >
> > > > Intel's HFI RDMA driver uses this model extensively, and I think it is
> > > > well proven, within some limitations of course.
> > > >
> > > > At least, 'registration cache' is the only use model I know of where
> > > > it is acceptable to skip invalidate_range_end.
> > >
> > > Here GPU are not in the registration cache model, i know it might looks
> > > like it because of GUP but GUP was use just because hmm did not exist
> > > at the time.
> >
> > It is not because of GUP, it is because of the lack of
> > invalidate_range_end. A driver cannot correctly implement the SPTE
> > model without invalidate_range_end, even if it holds the page pins via
> > GUP.
> >
> > So, I've been assuming the few drivers without invalidate_range_end
> > are trying to do registration caching, rather than assuming they are
> > broken.
> 
> I915 might just be broken. amdgpu does the full thing, using
> hmm_mirror. But still with dma_fence_wait.

Yeah i915 is broken but it never hurted anyone ;) I posted patch
a long time ago to convert it to hmm but i delayed that to until
i can get through making something of GUPfast that can also be
use for HMM/ODP user.

Cheers,
Jérôme
Jason Gunthorpe Aug. 15, 2019, 5:35 p.m. UTC | #24
On Thu, Aug 15, 2019 at 06:25:16PM +0200, Daniel Vetter wrote:

> I'm not really well versed in the details of our userptr, but both
> amdgpu and i915 wait for the gpu to complete from
> invalidate_range_start. Jerome has at least looked a lot at the amdgpu
> one, so maybe he can explain what exactly it is we're doing ...

amdgpu is (wrongly) using hmm for something, I can't really tell what
it is trying to do. The calls to dma_fence under the
invalidate_range_start do not give me a good feeling.

However, i915 shows all the signs of trying to follow the registration
cache model, it even has a nice comment in
i915_gem_userptr_get_pages() explaining that the races it has don't
matter because it is a user space bug to change the VA mapping in the
first place. That just screams registration cache to me.

So it is fine to run HW that way, but if you do, there is no reason to
fence inside the invalidate_range end. Just orphan the DMA buffer and
clean it up & release the page pins when all DMA buffer refs go to
zero. The next access to that VA should get a new DMA buffer with the
right mapping.

In other words the invalidation should be very simple without
complicated locking, or wait_event's. Look at hfi1 for example.

Jason
Jerome Glisse Aug. 15, 2019, 5:39 p.m. UTC | #25
On Thu, Aug 15, 2019 at 02:35:57PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 06:25:16PM +0200, Daniel Vetter wrote:
> 
> > I'm not really well versed in the details of our userptr, but both
> > amdgpu and i915 wait for the gpu to complete from
> > invalidate_range_start. Jerome has at least looked a lot at the amdgpu
> > one, so maybe he can explain what exactly it is we're doing ...
> 
> amdgpu is (wrongly) using hmm for something, I can't really tell what
> it is trying to do. The calls to dma_fence under the
> invalidate_range_start do not give me a good feeling.
> 
> However, i915 shows all the signs of trying to follow the registration
> cache model, it even has a nice comment in
> i915_gem_userptr_get_pages() explaining that the races it has don't
> matter because it is a user space bug to change the VA mapping in the
> first place. That just screams registration cache to me.
> 
> So it is fine to run HW that way, but if you do, there is no reason to
> fence inside the invalidate_range end. Just orphan the DMA buffer and
> clean it up & release the page pins when all DMA buffer refs go to
> zero. The next access to that VA should get a new DMA buffer with the
> right mapping.
> 
> In other words the invalidation should be very simple without
> complicated locking, or wait_event's. Look at hfi1 for example.

This would break the today usage model of uptr and it will
break userspace expectation ie if GPU is writting to that
memory and that memory then the userspace want to make sure
that it will see what the GPU write.

Yes i915 is broken in respect to not having a end notifier
and tracking active invalidation for a range but the GUP
side of thing kind of hide this bug and it shrinks the window
for bad to happen to something so small that i doubt anyone
could ever hit it (still a bug thought).

Cheers,
Jérôme
Michal Hocko Aug. 15, 2019, 5:42 p.m. UTC | #26
On Thu 15-08-19 13:56:31, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:
> 
> > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > > __GFP_DIRECT_RECLAIM..
> > >
> > > This matches the existing test in __need_fs_reclaim() - so if you are
> > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > > allocations during OOM, then I think fs_reclaim already matches what
> > > you described?
> > 
> > No GFP_NOFS is equally bad. Please read my other email explaining what
> > the oom_reaper actually requires. In short no blocking on direct or
> > indirect dependecy on memory allocation that might sleep.
> 
> It is much easier to follow with some hints on code, so the true
> requirement is that the OOM repear not block on GFP_FS and GFP_IO
> allocations, great, that constraint is now clear.

I still do not get why do you put FS/IO into the picture. This is really
about __GFP_DIRECT_RECLAIM.

> 
> > If you can express that in the existing lockdep machinery. All
> > fine. But then consider deployments where lockdep is no-no because
> > of the overhead.
> 
> This is all for driver debugging. The point of lockdep is to find all
> these paths without have to hit them as actual races, using debug
> kernels.
> 
> I don't think we need this kind of debugging on production kernels?

Again, the primary motivation was a simple debugging aid that could be
used without worrying about overhead. So lockdep is very often out of
the question.

> > > The best we got was drivers tested the VA range and returned success
> > > if they had no interest. Which is a big win to be sure, but it looks
> > > like getting any more is not really posssible.
> > 
> > And that is already a great win! Because many notifiers only do care
> > about particular mappings. Please note that backing off unconditioanlly
> > will simply cause that the oom reaper will have to back off not doing
> > any tear down anything.
> 
> Well, I'm working to propose that we do the VA range test under core
> mmu notifier code that cannot block and then we simply remove the idea
> of blockable from drivers using this new 'range notifier'. 
> 
> I think this pretty much solves the concern?

Well, my idea was that a range check and early bail out was a first step
and then each specific notifier would be able to do a more specific
check. I was not able to do the second step because that requires a deep
understanding of the respective subsystem.

Really all I do care about is to reclaim as much memory from the
oom_reaper context as possible. And that cannot really be an unbounded
process. Quite contrary it should be as swift as possible. From my
cursory look some notifiers are able to achieve their task without
blocking or depending on memory just fine. So bailing out
unconditionally on the range of interest would just put us back.

> > > However, we could (probably even should) make the drivers fs_reclaim
> > > safe.
> > > 
> > > If that is enough to guarantee progress of OOM, then lets consider
> > > something like using current_gfp_context() to force PF_MEMALLOC_NOFS
> > > allocation behavior on the driver callback and lockdep to try and keep
> > > pushing on the the debugging, and dropping !blocking.
> > 
> > How are you going to enforce indirect dependency? E.g. a lock that is
> > also used in other context which depend on sleepable memory allocation
> > to move forward.
> 
> You mean like this:
> 
>        CPU0                                 CPU1
>                                         mutex_lock()
>                                         kmalloc(GFP_KERNEL)

no I mean __GFP_DIRECT_RECLAIM here.

>                                         mutex_unlock()
>   fs_reclaim_acquire()
>   mutex_lock() <- illegal: lock dep assertion

I cannot really comment on how that is achieveable by lockdep. I managed
to forget details about FS/IO reclaim recursion tracking already and I
do not have time to learn it again. It was quite a hack. Anyway, let me
repeat that the primary motivation was a simple aid. Not something as
poverful as lockdep.
Jerome Glisse Aug. 15, 2019, 5:57 p.m. UTC | #27
On Thu, Aug 15, 2019 at 07:42:07PM +0200, Michal Hocko wrote:
> On Thu 15-08-19 13:56:31, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:
> > 
> > > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > > > __GFP_DIRECT_RECLAIM..
> > > >
> > > > This matches the existing test in __need_fs_reclaim() - so if you are
> > > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > > > allocations during OOM, then I think fs_reclaim already matches what
> > > > you described?
> > > 
> > > No GFP_NOFS is equally bad. Please read my other email explaining what
> > > the oom_reaper actually requires. In short no blocking on direct or
> > > indirect dependecy on memory allocation that might sleep.
> > 
> > It is much easier to follow with some hints on code, so the true
> > requirement is that the OOM repear not block on GFP_FS and GFP_IO
> > allocations, great, that constraint is now clear.
> 
> I still do not get why do you put FS/IO into the picture. This is really
> about __GFP_DIRECT_RECLAIM.
> 
> > 
> > > If you can express that in the existing lockdep machinery. All
> > > fine. But then consider deployments where lockdep is no-no because
> > > of the overhead.
> > 
> > This is all for driver debugging. The point of lockdep is to find all
> > these paths without have to hit them as actual races, using debug
> > kernels.
> > 
> > I don't think we need this kind of debugging on production kernels?
> 
> Again, the primary motivation was a simple debugging aid that could be
> used without worrying about overhead. So lockdep is very often out of
> the question.
> 
> > > > The best we got was drivers tested the VA range and returned success
> > > > if they had no interest. Which is a big win to be sure, but it looks
> > > > like getting any more is not really posssible.
> > > 
> > > And that is already a great win! Because many notifiers only do care
> > > about particular mappings. Please note that backing off unconditioanlly
> > > will simply cause that the oom reaper will have to back off not doing
> > > any tear down anything.
> > 
> > Well, I'm working to propose that we do the VA range test under core
> > mmu notifier code that cannot block and then we simply remove the idea
> > of blockable from drivers using this new 'range notifier'. 
> > 
> > I think this pretty much solves the concern?
> 
> Well, my idea was that a range check and early bail out was a first step
> and then each specific notifier would be able to do a more specific
> check. I was not able to do the second step because that requires a deep
> understanding of the respective subsystem.
> 
> Really all I do care about is to reclaim as much memory from the
> oom_reaper context as possible. And that cannot really be an unbounded
> process. Quite contrary it should be as swift as possible. From my
> cursory look some notifiers are able to achieve their task without
> blocking or depending on memory just fine. So bailing out
> unconditionally on the range of interest would just put us back.

Agree, OOM just asking the question: can i unmap that page quickly ?
so that me (OOM) can swap it out. In many cases the driver need to
lookup something to see if at the time the memory is just not in use
and can be reclaim right away. So driver should have a path to
optimistically update its state to allow quick reclaim.


> > > > However, we could (probably even should) make the drivers fs_reclaim
> > > > safe.
> > > > 
> > > > If that is enough to guarantee progress of OOM, then lets consider
> > > > something like using current_gfp_context() to force PF_MEMALLOC_NOFS
> > > > allocation behavior on the driver callback and lockdep to try and keep
> > > > pushing on the the debugging, and dropping !blocking.
> > > 
> > > How are you going to enforce indirect dependency? E.g. a lock that is
> > > also used in other context which depend on sleepable memory allocation
> > > to move forward.
> > 
> > You mean like this:
> > 
> >        CPU0                                 CPU1
> >                                         mutex_lock()
> >                                         kmalloc(GFP_KERNEL)
> 
> no I mean __GFP_DIRECT_RECLAIM here.
> 
> >                                         mutex_unlock()
> >   fs_reclaim_acquire()
> >   mutex_lock() <- illegal: lock dep assertion
> 
> I cannot really comment on how that is achieveable by lockdep. I managed
> to forget details about FS/IO reclaim recursion tracking already and I
> do not have time to learn it again. It was quite a hack. Anyway, let me
> repeat that the primary motivation was a simple aid. Not something as
> poverful as lockdep.

I feel that the fs_reclaim_acquire() is just too heavy weight here. I
do think that Daniel patches improve the debugging situation without
burdening anything so i am in favor or merging that.

I do not think we should devote too much time into fs_reclaim(), our
time would be better spend in improving the driver shrinker for instance
after all OOM is all about trying to free-up memory.

Cheers,
Jérôme
Jason Gunthorpe Aug. 15, 2019, 6:01 p.m. UTC | #28
On Thu, Aug 15, 2019 at 01:39:22PM -0400, Jerome Glisse wrote:
> On Thu, Aug 15, 2019 at 02:35:57PM -0300, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 06:25:16PM +0200, Daniel Vetter wrote:
> > 
> > > I'm not really well versed in the details of our userptr, but both
> > > amdgpu and i915 wait for the gpu to complete from
> > > invalidate_range_start. Jerome has at least looked a lot at the amdgpu
> > > one, so maybe he can explain what exactly it is we're doing ...
> > 
> > amdgpu is (wrongly) using hmm for something, I can't really tell what
> > it is trying to do. The calls to dma_fence under the
> > invalidate_range_start do not give me a good feeling.
> > 
> > However, i915 shows all the signs of trying to follow the registration
> > cache model, it even has a nice comment in
> > i915_gem_userptr_get_pages() explaining that the races it has don't
> > matter because it is a user space bug to change the VA mapping in the
> > first place. That just screams registration cache to me.
> > 
> > So it is fine to run HW that way, but if you do, there is no reason to
> > fence inside the invalidate_range end. Just orphan the DMA buffer and
> > clean it up & release the page pins when all DMA buffer refs go to
> > zero. The next access to that VA should get a new DMA buffer with the
> > right mapping.
> > 
> > In other words the invalidation should be very simple without
> > complicated locking, or wait_event's. Look at hfi1 for example.
> 
> This would break the today usage model of uptr and it will
> break userspace expectation ie if GPU is writting to that
> memory and that memory then the userspace want to make sure
> that it will see what the GPU write.

How exactly? This is holding the page pin, so the only way the VA
mapping can be changed is via explicit user action.

ie:

   gpu_write_something(va, size)
   mmap(.., va, size, MMAP_FIXED);
   gpu_wait_done()

This is racy and indeterminate with both models.

Based on the comment in i915 it appears to be going on the model that
changes to the mmap by userspace when the GPU is working on it is a
programming bug. This is reasonable, lots of systems use this kind of
consistency model.

Since the driver seems to rely on a dma_fence to block DMA access, it
looks to me like the kernel has full visibility to the
'gpu_write_something()' and 'gpu_wait_done()' markers.

I think trying to use hmm_range_fault on HW that can't do HW page
faulting and HW 'TLB shootdown' is a very, very bad idea. I fear that
is what amd gpu is trying to do.

I haven't yet seen anything that looks like 'TLB shootdown' in i915??

Jason
Jason Gunthorpe Aug. 15, 2019, 6:24 p.m. UTC | #29
On Thu, Aug 15, 2019 at 07:42:07PM +0200, Michal Hocko wrote:
> On Thu 15-08-19 13:56:31, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:
> > 
> > > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > > > __GFP_DIRECT_RECLAIM..
> > > >
> > > > This matches the existing test in __need_fs_reclaim() - so if you are
> > > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > > > allocations during OOM, then I think fs_reclaim already matches what
> > > > you described?
> > > 
> > > No GFP_NOFS is equally bad. Please read my other email explaining what
> > > the oom_reaper actually requires. In short no blocking on direct or
> > > indirect dependecy on memory allocation that might sleep.
> > 
> > It is much easier to follow with some hints on code, so the true
> > requirement is that the OOM repear not block on GFP_FS and GFP_IO
> > allocations, great, that constraint is now clear.
> 
> I still do not get why do you put FS/IO into the picture. This is really
> about __GFP_DIRECT_RECLAIM.

Like I said this is complicated, translating "no blocking on direct or
indirect dependecy on memory allocation that might sleep" into GFP
flags is hard for us outside the mm community.

So the contraint here is no __GFP_DIRECT_RECLAIM?

I bring up FS/IO because that is what Tejun mentioned when I asked him
about reclaim restrictions, and is what fs_reclaim_acquire() is
already sensitive too. It is pretty confusing if we have places using
the word 'reclaim' with different restrictions. :(

> >        CPU0                                 CPU1
> >                                         mutex_lock()
> >                                         kmalloc(GFP_KERNEL)
> 
> no I mean __GFP_DIRECT_RECLAIM here.
> 
> >                                         mutex_unlock()
> >   fs_reclaim_acquire()
> >   mutex_lock() <- illegal: lock dep assertion
> 
> I cannot really comment on how that is achieveable by lockdep.

??? I am trying to explain this is already done and working today. The
above example will already fault with lockdep enabled.

This is existing debugging we can use and improve upon rather that
invent new debugging.

Jason
Jerome Glisse Aug. 15, 2019, 6:27 p.m. UTC | #30
On Thu, Aug 15, 2019 at 03:01:59PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 01:39:22PM -0400, Jerome Glisse wrote:
> > On Thu, Aug 15, 2019 at 02:35:57PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Aug 15, 2019 at 06:25:16PM +0200, Daniel Vetter wrote:
> > > 
> > > > I'm not really well versed in the details of our userptr, but both
> > > > amdgpu and i915 wait for the gpu to complete from
> > > > invalidate_range_start. Jerome has at least looked a lot at the amdgpu
> > > > one, so maybe he can explain what exactly it is we're doing ...
> > > 
> > > amdgpu is (wrongly) using hmm for something, I can't really tell what
> > > it is trying to do. The calls to dma_fence under the
> > > invalidate_range_start do not give me a good feeling.
> > > 
> > > However, i915 shows all the signs of trying to follow the registration
> > > cache model, it even has a nice comment in
> > > i915_gem_userptr_get_pages() explaining that the races it has don't
> > > matter because it is a user space bug to change the VA mapping in the
> > > first place. That just screams registration cache to me.
> > > 
> > > So it is fine to run HW that way, but if you do, there is no reason to
> > > fence inside the invalidate_range end. Just orphan the DMA buffer and
> > > clean it up & release the page pins when all DMA buffer refs go to
> > > zero. The next access to that VA should get a new DMA buffer with the
> > > right mapping.
> > > 
> > > In other words the invalidation should be very simple without
> > > complicated locking, or wait_event's. Look at hfi1 for example.
> > 
> > This would break the today usage model of uptr and it will
> > break userspace expectation ie if GPU is writting to that
> > memory and that memory then the userspace want to make sure
> > that it will see what the GPU write.
> 
> How exactly? This is holding the page pin, so the only way the VA
> mapping can be changed is via explicit user action.
> 
> ie:
> 
>    gpu_write_something(va, size)
>    mmap(.., va, size, MMAP_FIXED);
>    gpu_wait_done()
> 
> This is racy and indeterminate with both models.
> 
> Based on the comment in i915 it appears to be going on the model that
> changes to the mmap by userspace when the GPU is working on it is a
> programming bug. This is reasonable, lots of systems use this kind of
> consistency model.

Well userspace process doing munmap(), mremap(), fork() and things like
that are a bug from the i915 kernel and userspace contract point of view.

But things like migration or reclaim are not cover under that contract
and for those the expectation is that CPU access to the same virtual address
should allow to get what was last written to it either by the GPU or the
CPU.

> 
> Since the driver seems to rely on a dma_fence to block DMA access, it
> looks to me like the kernel has full visibility to the
> 'gpu_write_something()' and 'gpu_wait_done()' markers.

So let's only consider the case where GPU wants to write to the memory
(the read only case is obviously right and not need any notifier in
fact) and like above the only thing we care about is reclaim or migration
(for instance because of some numa compaction) as the rest is cover by
i915 userspace contract.

So in the write case we do GUPfast(write=true) which will be "safe" from
any concurrent CPU page table update ie if GUPfast get a reference for
the page then any racing CPU page table update will not be able to migrate
or reclaim the page and thus the virtual address to page association will
be preserve.

This is only true because of GUPfast(), now if GUPfast() fails it will
fallback to the slow GUP case which make the same thing safe by taking
the page table lock.

Because of the reference on the page the i915 driver can forego the mmu
notifier end callback. The thing here is that taking a page reference
is pointless if we have better synchronization and tracking of mmu
notifier. Hence converting to hmm mirror allows to avoid taking a ref
on the page while still keeping the same functionality as of today.


> I think trying to use hmm_range_fault on HW that can't do HW page
> faulting and HW 'TLB shootdown' is a very, very bad idea. I fear that
> is what amd gpu is trying to do.
> 
> I haven't yet seen anything that looks like 'TLB shootdown' in i915??

GPU driver have complex usage pattern the tlb shootdown is implicit
once the GEM object associated with the uptr is invalidated it means
next time userspace submit command against that GEM object it will
have to re-validate it which means re-program the GPU page table to
point to the proper address (and re-call GUP).

So the whole GPU page table update is all hidden behind GEM function
like i915_gem_object_unbind() (or ttm invalidate for amd/radeon).

Technicaly those GPU do not support page faulting but because of the
way GPU works you know that you have frequent checkpoint where you
can unbind things. This is what is happening here.

Also not all GPU engines can handle page fault, this is true of all
GPU with page fault AFAIK (AMD and NVidia so far). This is why
uptr is a limited form of SVM (share virtual memory) that can be
use on all GPUs engine including the dma engine. When using the
full SVM power (like hmm mirror with nouveau) this is only use in
GPU engine that supports page fault (but updating the page table
still require locking and waiting on acknowledge).

Cheers,
Jérôme
Jason Gunthorpe Aug. 15, 2019, 6:57 p.m. UTC | #31
On Thu, Aug 15, 2019 at 02:27:19PM -0400, Jerome Glisse wrote:
> > How exactly? This is holding the page pin, so the only way the VA
> > mapping can be changed is via explicit user action.
> > 
> > ie:
> > 
> >    gpu_write_something(va, size)
> >    mmap(.., va, size, MMAP_FIXED);
> >    gpu_wait_done()
> > 
> > This is racy and indeterminate with both models.
> > 
> > Based on the comment in i915 it appears to be going on the model that
> > changes to the mmap by userspace when the GPU is working on it is a
> > programming bug. This is reasonable, lots of systems use this kind of
> > consistency model.
> 
> Well userspace process doing munmap(), mremap(), fork() and things like
> that are a bug from the i915 kernel and userspace contract point of view.
> 
> But things like migration or reclaim are not cover under that contract
> and for those the expectation is that CPU access to the same virtual address
> should allow to get what was last written to it either by the GPU or the
> CPU.

Okay, this is a more reasonable point - I agree the i915 registration
cache model precludes using migration and thus DEVICE_PRIVATE. This is
a strong motivation to use the hmm approach

But we started out this converstation asking if i915 is correct, and I
still say a registration cache model is a functionally correct way to
use notifiers.

> Because of the reference on the page the i915 driver can forego the mmu
> notifier end callback. The thing here is that taking a page reference
> is pointless if we have better synchronization and tracking of mmu
> notifier. Hence converting to hmm mirror allows to avoid taking a ref
> on the page while still keeping the same functionality as of today.

However, there is a huge trade off here. Drivers like this are going
to have a very complicated locking inside invalidate_range_start as
they must sleep waiting for dma buffer references to go to zero.

> GPU driver have complex usage pattern the tlb shootdown is implicit
> once the GEM object associated with the uptr is invalidated it means
> next time userspace submit command against that GEM object it will
> have to re-validate it which means re-program the GPU page table to
> point to the proper address (and re-call GUP).

I think it is a mistake to try and cram the very different approaches
to notifiers into the same API. SW ref counting of DMA buffers vs HW
async page faulting have totally different requirements and locking
schemes.

This explains why AMDGPU gets away with not using the hmm API
properly, it is probably relying on its DMA refcount, not the hmm
valid, to keep things in order?

I think the API approach in hmm_mirror is reasonable for page faulting
HW, but does not serve refcounting HW well at all.

Jason
Michal Hocko Aug. 15, 2019, 7:05 p.m. UTC | #32
On Thu 15-08-19 15:24:48, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 07:42:07PM +0200, Michal Hocko wrote:
> > On Thu 15-08-19 13:56:31, Jason Gunthorpe wrote:
> > > On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:
> > > 
> > > > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > > > > __GFP_DIRECT_RECLAIM..
> > > > >
> > > > > This matches the existing test in __need_fs_reclaim() - so if you are
> > > > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > > > > allocations during OOM, then I think fs_reclaim already matches what
> > > > > you described?
> > > > 
> > > > No GFP_NOFS is equally bad. Please read my other email explaining what
> > > > the oom_reaper actually requires. In short no blocking on direct or
> > > > indirect dependecy on memory allocation that might sleep.
> > > 
> > > It is much easier to follow with some hints on code, so the true
> > > requirement is that the OOM repear not block on GFP_FS and GFP_IO
> > > allocations, great, that constraint is now clear.
> > 
> > I still do not get why do you put FS/IO into the picture. This is really
> > about __GFP_DIRECT_RECLAIM.
> 
> Like I said this is complicated, translating "no blocking on direct or
> indirect dependecy on memory allocation that might sleep" into GFP
> flags is hard for us outside the mm community.
> 
> So the contraint here is no __GFP_DIRECT_RECLAIM?

OK, I am obviously failing to explain that. Sorry about that. You are
right that this is not simple. Let me try again.

The context we are calling !blockable notifiers from has to finish in a
_finite_ amount of time (and swift is hugely appreciated by users of
otherwise non-responsive system that is under OOM). We are out of memory
so we cannot be blocked waiting for memory. Directly or indirectly (via
a lock, waiting for an event that needs memory to finish in general). So
you need to track dependency over more complicated contexts than the
direct call path (think of workqueue for example).

> I bring up FS/IO because that is what Tejun mentioned when I asked him
> about reclaim restrictions, and is what fs_reclaim_acquire() is
> already sensitive too. It is pretty confusing if we have places using
> the word 'reclaim' with different restrictions. :(

fs_reclaim has been invented to catch potential deadlocks when a
GFP_NO{FS/IO} allocation hits into fs/io reclaim. This is a subset of
the reclaim that we have. The oom context is more restricted and it
cannot depend on _any_ memory reclaim because by the time we have got to
this context all the reclaim has already failed and wait some more will
simply not help.

> > >        CPU0                                 CPU1
> > >                                         mutex_lock()
> > >                                         kmalloc(GFP_KERNEL)
> > 
> > no I mean __GFP_DIRECT_RECLAIM here.
> > 
> > >                                         mutex_unlock()
> > >   fs_reclaim_acquire()
> > >   mutex_lock() <- illegal: lock dep assertion
> > 
> > I cannot really comment on how that is achieveable by lockdep.
> 
> ??? I am trying to explain this is already done and working today. The
> above example will already fault with lockdep enabled.
> 
> This is existing debugging we can use and improve upon rather that
> invent new debugging.

This is what you claim and I am saying that fs_reclaim is about a
restricted reclaim context and it is an ugly hack. It has proven to
report false positives. Maybe it can be extended to a generic reclaim.
I haven't tried that. Do not aim to try it.
Jason Gunthorpe Aug. 15, 2019, 7:18 p.m. UTC | #33
On Thu, Aug 15, 2019 at 09:05:25PM +0200, Michal Hocko wrote:

> This is what you claim and I am saying that fs_reclaim is about a
> restricted reclaim context and it is an ugly hack. It has proven to
> report false positives. Maybe it can be extended to a generic reclaim.
> I haven't tried that. Do not aim to try it.

Okay, great, I think this has been very helpful, at least for me,
thanks. I did not know fs_reclaim was so problematic, or the special
cases about OOM 'reclaim'. 

On this patch, I have no general objection to enforcing drivers to be
non-blocking, I'd just like to see it done with the existing lockdep
can't sleep detection rather than inventing some new debugging for it.

I understand this means the debugging requires lockdep enabled and
will not run in production, but I'm of the view that is OK and in line
with general kernel practice.

The last detail is I'm still unclear what a GFP flags a blockable
invalidate_range_start() should use. Is GFP_KERNEL OK? Lockdep has
complained on that in past due to fs_reclaim - how do you know if it
is a false positive?

Thanks again,
Jason
Michal Hocko Aug. 15, 2019, 7:35 p.m. UTC | #34
On Thu 15-08-19 16:18:10, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 09:05:25PM +0200, Michal Hocko wrote:
> 
> > This is what you claim and I am saying that fs_reclaim is about a
> > restricted reclaim context and it is an ugly hack. It has proven to
> > report false positives. Maybe it can be extended to a generic reclaim.
> > I haven't tried that. Do not aim to try it.
> 
> Okay, great, I think this has been very helpful, at least for me,
> thanks. I did not know fs_reclaim was so problematic, or the special
> cases about OOM 'reclaim'. 

I am happy that this is more clear now.

> On this patch, I have no general objection to enforcing drivers to be
> non-blocking, I'd just like to see it done with the existing lockdep
> can't sleep detection rather than inventing some new debugging for it.
> 
> I understand this means the debugging requires lockdep enabled and
> will not run in production, but I'm of the view that is OK and in line
> with general kernel practice.

Yes and I do agree with this in general.

> The last detail is I'm still unclear what a GFP flags a blockable
> invalidate_range_start() should use. Is GFP_KERNEL OK?

I hope I will not make this muddy again ;)
invalidate_range_start in the blockable mode can use/depend on any sleepable
allocation allowed in the context it is called from. So in other words
it is no different from any other function in the kernel that calls into
allocator. As the API is missing gfp context then I hope it is not
called from any restricted contexts (except from the oom which we have
!blockable for).

> Lockdep has
> complained on that in past due to fs_reclaim - how do you know if it
> is a false positive?

I would have to see the specific lockdep splat.
Jason Gunthorpe Aug. 15, 2019, 8:13 p.m. UTC | #35
On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote:

> > The last detail is I'm still unclear what a GFP flags a blockable
> > invalidate_range_start() should use. Is GFP_KERNEL OK?
> 
> I hope I will not make this muddy again ;)
> invalidate_range_start in the blockable mode can use/depend on any sleepable
> allocation allowed in the context it is called from. 

'in the context is is called from' is the magic phrase, as
invalidate_range_start is called while holding several different mm
related locks. I know at least write mmap_sem and i_mmap_rwsem
(write?)

Can GFP_KERNEL be called while holding those locks?

This is the question of indirect dependency on reclaim via locks you
raised earlier.

> So in other words it is no different from any other function in the
> kernel that calls into allocator. As the API is missing gfp context
> then I hope it is not called from any restricted contexts (except
> from the oom which we have !blockable for).

Yes, the callers are exactly my concern.
 
> > Lockdep has
> > complained on that in past due to fs_reclaim - how do you know if it
> > is a false positive?
> 
> I would have to see the specific lockdep splat.

See below. I found it when trying to understand why the registration
of the mmu notififer was so oddly coded.

The situation was:

  down_write(&mm->mmap_sem);
  mm_take_all_locks(mm);
  kmalloc(GFP_KERNEL);  <--- lockdep warning

I understood Daniel said he saw this directly on a recent kernel when
working with his lockdep patch?

Checking myself, on todays kernel I see a call chain:

shrink_all_memory
  fs_reclaim_acquire(sc.gfp_mask);
  [..]
  do_try_to_free_pages
   shrink_zones
    shrink_node
     shrink_node_memcg
      shrink_list
       shrink_active_list
        page_referenced
         rmap_walk
          rmap_walk_file
           i_mmap_lock_read
            down_read(i_mmap_rwsem)

So it is possible that the down_read() above will block on
i_mmap_rwsem being held in the caller of invalidate_range_start which
is doing kmalloc(GPF_KERNEL).

Is this OK? The lockdep annotation says no..

Jason

commit 35cfa2b0b491c37e23527822bf365610dbb188e5
Author: Gavin Shan <shangw@linux.vnet.ibm.com>
Date:   Thu Oct 25 13:38:01 2012 -0700

    mm/mmu_notifier: allocate mmu_notifier in advance
    
    While allocating mmu_notifier with parameter GFP_KERNEL, swap would start
    to work in case of tight available memory.  Eventually, that would lead to
    a deadlock while the swap deamon swaps anonymous pages.  It was caused by
    commit e0f3c3f78da29b ("mm/mmu_notifier: init notifier if necessary").
    
      =================================
      [ INFO: inconsistent lock state ]
      3.7.0-rc1+ #518 Not tainted
      ---------------------------------
      inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
      kswapd0/35 [HC0[0]:SC0[0]:HE1:SE1] takes:
       (&mapping->i_mmap_mutex){+.+.?.}, at: page_referenced+0x9c/0x2e0
      {RECLAIM_FS-ON-W} state was registered at:
         mark_held_locks+0x86/0x150
         lockdep_trace_alloc+0x67/0xc0
         kmem_cache_alloc_trace+0x33/0x230
         do_mmu_notifier_register+0x87/0x180
         mmu_notifier_register+0x13/0x20
         kvm_dev_ioctl+0x428/0x510
         do_vfs_ioctl+0x98/0x570
         sys_ioctl+0x91/0xb0
         system_call_fastpath+0x16/0x1b
Daniel Vetter Aug. 15, 2019, 8:16 p.m. UTC | #36
On Thu, Aug 15, 2019 at 9:35 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 15-08-19 16:18:10, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 09:05:25PM +0200, Michal Hocko wrote:
> >
> > > This is what you claim and I am saying that fs_reclaim is about a
> > > restricted reclaim context and it is an ugly hack. It has proven to
> > > report false positives. Maybe it can be extended to a generic reclaim.
> > > I haven't tried that. Do not aim to try it.
> >
> > Okay, great, I think this has been very helpful, at least for me,
> > thanks. I did not know fs_reclaim was so problematic, or the special
> > cases about OOM 'reclaim'.
>
> I am happy that this is more clear now.
>
> > On this patch, I have no general objection to enforcing drivers to be
> > non-blocking, I'd just like to see it done with the existing lockdep
> > can't sleep detection rather than inventing some new debugging for it.
> >
> > I understand this means the debugging requires lockdep enabled and
> > will not run in production, but I'm of the view that is OK and in line
> > with general kernel practice.
>
> Yes and I do agree with this in general.

So if someone can explain to me how that works with lockdep I can of
course implement it. But afaics that doesn't exist (I tried to explain
that somewhere else already), and I'm no really looking forward to
hacking also on lockdep for this little series.

> > The last detail is I'm still unclear what a GFP flags a blockable
> > invalidate_range_start() should use. Is GFP_KERNEL OK?
>
> I hope I will not make this muddy again ;)
> invalidate_range_start in the blockable mode can use/depend on any sleepable
> allocation allowed in the context it is called from. So in other words
> it is no different from any other function in the kernel that calls into
> allocator. As the API is missing gfp context then I hope it is not
> called from any restricted contexts (except from the oom which we have
> !blockable for).

Hm, that's new to me. I thought mmu notifiers very much can be called
from direct reclaim paths, so you have to be extremely careful with
getting back into that one. At least the lockdep splats I remember
also tend to have fs_reclaim in there, that's where all the fun comes
from.

> > Lockdep has
> > complained on that in past due to fs_reclaim - how do you know if it
> > is a false positive?
>
> I would have to see the specific lockdep splat.

I guess the lockdep annotation for invalidate_range_start carries the
same risks as the fs_reclaim annotation. Still feels like worth it.
-Daniel
Jason Gunthorpe Aug. 15, 2019, 8:27 p.m. UTC | #37
On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote:

> So if someone can explain to me how that works with lockdep I can of
> course implement it. But afaics that doesn't exist (I tried to explain
> that somewhere else already), and I'm no really looking forward to
> hacking also on lockdep for this little series.

Hmm, kind of looks like it is done by calling preempt_disable()

Probably the debug option is CONFIG_DEBUG_PREEMPT, not lockdep?

Jason
Daniel Vetter Aug. 15, 2019, 8:49 p.m. UTC | #38
On Thu, Aug 15, 2019 at 10:27 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote:
> > So if someone can explain to me how that works with lockdep I can of
> > course implement it. But afaics that doesn't exist (I tried to explain
> > that somewhere else already), and I'm no really looking forward to
> > hacking also on lockdep for this little series.
>
> Hmm, kind of looks like it is done by calling preempt_disable()

Yup. That was v1, then came the suggestion that disabling preemption
is maybe not the best thing (the oom reaper could still run for a long
time comparatively, if it's cleaning out gigabytes of process memory
or what not, hence this dedicated debug infrastructure).

> Probably the debug option is CONFIG_DEBUG_PREEMPT, not lockdep?

CONFIG_DEBUG_ATOMIC_SLEEP. Like in my patch.
-Daniel
Andrew Morton Aug. 15, 2019, 10:15 p.m. UTC | #39
On Thu, 15 Aug 2019 10:44:29 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> > I continue to struggle with this.  It introduces a new kernel state
> > "running preemptibly but must not call schedule()".  How does this make
> > any sense?
> > 
> > Perhaps a much, much more detailed description of the oom_reaper
> > situation would help out.
>  
> The primary point here is that there is a demand of non blockable mmu
> notifiers to be called when the oom reaper tears down the address space.
> As the oom reaper is the primary guarantee of the oom handling forward
> progress it cannot be blocked on anything that might depend on blockable
> memory allocations. These are not really easy to track because they
> might be indirect - e.g. notifier blocks on a lock which other context
> holds while allocating memory or waiting for a flusher that needs memory
> to perform its work. If such a blocking state happens that we can end up
> in a silent hang with an unusable machine.
> Now we hope for reasonable implementations of mmu notifiers (strong
> words I know ;) and this should be relatively simple and effective catch
> all tool to detect something suspicious is going on.
> 
> Does that make the situation more clear?

Yes, thanks, much.  Maybe a code comment along the lines of

  This is on behalf of the oom reaper, specifically when it is
  calling the mmu notifiers.  The problem is that if the notifier were
  to block on, for example, mutex_lock() and if the process which holds
  that mutex were to perform a sleeping memory allocation, the oom
  reaper is now blocked on completion of that memory allocation.

btw, do we need task_struct.non_block_count?  Perhaps the oom reaper
thread could set a new PF_NONBLOCK (which would be more general than
PF_OOM_REAPER).  If we run out of PF_ flags, use (current == oom_reaper_th).
Jason Gunthorpe Aug. 16, 2019, 1 a.m. UTC | #40
On Thu, Aug 15, 2019 at 10:49:31PM +0200, Daniel Vetter wrote:
> On Thu, Aug 15, 2019 at 10:27 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote:
> > > So if someone can explain to me how that works with lockdep I can of
> > > course implement it. But afaics that doesn't exist (I tried to explain
> > > that somewhere else already), and I'm no really looking forward to
> > > hacking also on lockdep for this little series.
> >
> > Hmm, kind of looks like it is done by calling preempt_disable()
> 
> Yup. That was v1, then came the suggestion that disabling preemption
> is maybe not the best thing (the oom reaper could still run for a long
> time comparatively, if it's cleaning out gigabytes of process memory
> or what not, hence this dedicated debug infrastructure).

Oh, I'm coming in late, sorry

Anyhow, I was thinking since we agreed this can trigger on some
CONFIG_DEBUG flag, something like

    /* This is a sleepable region, but use preempt_disable to get debugging
     * for calls that are not allowed to block for OOM [.. insert
     * Michal's explanation.. ] */
    if (IS_ENABLED(CONFIG_DEBUG_ATOMIC_SLEEP) && !mmu_notifier_range_blockable(range))
	preempt_disable();
    ops->invalidate_range_start();

And I have also been idly mulling doing something like

   if (IS_ENABLED(CONFIG_DEBUG_NOTIFIERS) && 
       rand &&
       mmu_notifier_range_blockable(range)) {
     range->flags = 0
     if (!ops->invalidate_range_start(range))
	continue

     // Failed, try again as blockable
     range->flags = MMU_NOTIFIER_RANGE_BLOCKABLE
   }
   ops->invalidate_range_start(range);

Which would give coverage for this corner case without forcing OOM.

Jason
Daniel Vetter Aug. 16, 2019, 6:20 a.m. UTC | #41
On Fri, Aug 16, 2019 at 3:00 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Thu, Aug 15, 2019 at 10:49:31PM +0200, Daniel Vetter wrote:
> > On Thu, Aug 15, 2019 at 10:27 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote:
> > > > So if someone can explain to me how that works with lockdep I can of
> > > > course implement it. But afaics that doesn't exist (I tried to explain
> > > > that somewhere else already), and I'm no really looking forward to
> > > > hacking also on lockdep for this little series.
> > >
> > > Hmm, kind of looks like it is done by calling preempt_disable()
> >
> > Yup. That was v1, then came the suggestion that disabling preemption
> > is maybe not the best thing (the oom reaper could still run for a long
> > time comparatively, if it's cleaning out gigabytes of process memory
> > or what not, hence this dedicated debug infrastructure).
>
> Oh, I'm coming in late, sorry
>
> Anyhow, I was thinking since we agreed this can trigger on some
> CONFIG_DEBUG flag, something like
>
>     /* This is a sleepable region, but use preempt_disable to get debugging
>      * for calls that are not allowed to block for OOM [.. insert
>      * Michal's explanation.. ] */
>     if (IS_ENABLED(CONFIG_DEBUG_ATOMIC_SLEEP) && !mmu_notifier_range_blockable(range))
>         preempt_disable();
>     ops->invalidate_range_start();

I think we also discussed that, and some expressed concerns it would
change behaviour/timing too much for testing. Since this does does
disable preemption for real, not just for might_sleep.

> And I have also been idly mulling doing something like
>
>    if (IS_ENABLED(CONFIG_DEBUG_NOTIFIERS) &&
>        rand &&
>        mmu_notifier_range_blockable(range)) {
>      range->flags = 0
>      if (!ops->invalidate_range_start(range))
>         continue
>
>      // Failed, try again as blockable
>      range->flags = MMU_NOTIFIER_RANGE_BLOCKABLE
>    }
>    ops->invalidate_range_start(range);
>
> Which would give coverage for this corner case without forcing OOM.

Hm, this sounds like a neat idea to slap on top. The rand is going to
be a bit tricky though, but I guess for this we could stuff another
counter into task_struct and just e.g. do this every 1000th or so
invalidate (well need to pick a prime so we cycle through notifiers in
case there's multiple). I like.

Michal, thoughts?
-Daniel
Michal Hocko Aug. 16, 2019, 8:10 a.m. UTC | #42
On Thu 15-08-19 17:13:23, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote:
> 
> > > The last detail is I'm still unclear what a GFP flags a blockable
> > > invalidate_range_start() should use. Is GFP_KERNEL OK?
> > 
> > I hope I will not make this muddy again ;)
> > invalidate_range_start in the blockable mode can use/depend on any sleepable
> > allocation allowed in the context it is called from. 
> 
> 'in the context is is called from' is the magic phrase, as
> invalidate_range_start is called while holding several different mm
> related locks. I know at least write mmap_sem and i_mmap_rwsem
> (write?)
> 
> Can GFP_KERNEL be called while holding those locks?

i_mmap_rwsem would be problematic because it is taken during the
reclaim.

> This is the question of indirect dependency on reclaim via locks you
> raised earlier.
> 
> > So in other words it is no different from any other function in the
> > kernel that calls into allocator. As the API is missing gfp context
> > then I hope it is not called from any restricted contexts (except
> > from the oom which we have !blockable for).
> 
> Yes, the callers are exactly my concern.
>  
> > > Lockdep has
> > > complained on that in past due to fs_reclaim - how do you know if it
> > > is a false positive?
> > 
> > I would have to see the specific lockdep splat.
> 
> See below. I found it when trying to understand why the registration
> of the mmu notififer was so oddly coded.
> 
> The situation was:
> 
>   down_write(&mm->mmap_sem);
>   mm_take_all_locks(mm);
>   kmalloc(GFP_KERNEL);  <--- lockdep warning

Ugh. mm_take_all_locks :/

> I understood Daniel said he saw this directly on a recent kernel when
> working with his lockdep patch?
> 
> Checking myself, on todays kernel I see a call chain:
> 
> shrink_all_memory
>   fs_reclaim_acquire(sc.gfp_mask);
>   [..]
>   do_try_to_free_pages
>    shrink_zones
>     shrink_node
>      shrink_node_memcg
>       shrink_list
>        shrink_active_list
>         page_referenced
>          rmap_walk
>           rmap_walk_file
>            i_mmap_lock_read
>             down_read(i_mmap_rwsem)
> 
> So it is possible that the down_read() above will block on
> i_mmap_rwsem being held in the caller of invalidate_range_start which
> is doing kmalloc(GPF_KERNEL).
> 
> Is this OK? The lockdep annotation says no..

It's not as per the above code patch which is easily possible because
mm_take_all_locks will lock all file vmas.
Michal Hocko Aug. 16, 2019, 8:24 a.m. UTC | #43
On Thu 15-08-19 15:15:09, Andrew Morton wrote:
> On Thu, 15 Aug 2019 10:44:29 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > > I continue to struggle with this.  It introduces a new kernel state
> > > "running preemptibly but must not call schedule()".  How does this make
> > > any sense?
> > > 
> > > Perhaps a much, much more detailed description of the oom_reaper
> > > situation would help out.
> >  
> > The primary point here is that there is a demand of non blockable mmu
> > notifiers to be called when the oom reaper tears down the address space.
> > As the oom reaper is the primary guarantee of the oom handling forward
> > progress it cannot be blocked on anything that might depend on blockable
> > memory allocations. These are not really easy to track because they
> > might be indirect - e.g. notifier blocks on a lock which other context
> > holds while allocating memory or waiting for a flusher that needs memory
> > to perform its work. If such a blocking state happens that we can end up
> > in a silent hang with an unusable machine.
> > Now we hope for reasonable implementations of mmu notifiers (strong
> > words I know ;) and this should be relatively simple and effective catch
> > all tool to detect something suspicious is going on.
> > 
> > Does that make the situation more clear?
> 
> Yes, thanks, much.  Maybe a code comment along the lines of
> 
>   This is on behalf of the oom reaper, specifically when it is
>   calling the mmu notifiers.  The problem is that if the notifier were
>   to block on, for example, mutex_lock() and if the process which holds
>   that mutex were to perform a sleeping memory allocation, the oom
>   reaper is now blocked on completion of that memory allocation.

    reaper is now blocked on completion of that memory allocation
    and cannot provide the guarantee of the OOM forward progress.

OK. 
 
> btw, do we need task_struct.non_block_count?  Perhaps the oom reaper
> thread could set a new PF_NONBLOCK (which would be more general than
> PF_OOM_REAPER).  If we run out of PF_ flags, use (current == oom_reaper_th).

Well, I do not have a strong opinion here. A simple check for the value
seems to be trivial. There are quite some holes in task_struct to hide
this counter without increasing the size.
Michal Hocko Aug. 16, 2019, 8:27 a.m. UTC | #44
On Thu 15-08-19 22:16:43, Daniel Vetter wrote:
> On Thu, Aug 15, 2019 at 9:35 PM Michal Hocko <mhocko@kernel.org> wrote:
[...]
> > > The last detail is I'm still unclear what a GFP flags a blockable
> > > invalidate_range_start() should use. Is GFP_KERNEL OK?
> >
> > I hope I will not make this muddy again ;)
> > invalidate_range_start in the blockable mode can use/depend on any sleepable
> > allocation allowed in the context it is called from. So in other words
> > it is no different from any other function in the kernel that calls into
> > allocator. As the API is missing gfp context then I hope it is not
> > called from any restricted contexts (except from the oom which we have
> > !blockable for).
> 
> Hm, that's new to me. I thought mmu notifiers very much can be called
> from direct reclaim paths, so you have to be extremely careful with
> getting back into that one.

Correct, I should have added that notifier callbacks ideally do not
allocate any memory. They can block and even that is quite a pain to be
honest.
Jason Gunthorpe Aug. 16, 2019, 12:12 p.m. UTC | #45
On Fri, Aug 16, 2019 at 08:20:55AM +0200, Daniel Vetter wrote:
> On Fri, Aug 16, 2019 at 3:00 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > On Thu, Aug 15, 2019 at 10:49:31PM +0200, Daniel Vetter wrote:
> > > On Thu, Aug 15, 2019 at 10:27 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote:
> > > > > So if someone can explain to me how that works with lockdep I can of
> > > > > course implement it. But afaics that doesn't exist (I tried to explain
> > > > > that somewhere else already), and I'm no really looking forward to
> > > > > hacking also on lockdep for this little series.
> > > >
> > > > Hmm, kind of looks like it is done by calling preempt_disable()
> > >
> > > Yup. That was v1, then came the suggestion that disabling preemption
> > > is maybe not the best thing (the oom reaper could still run for a long
> > > time comparatively, if it's cleaning out gigabytes of process memory
> > > or what not, hence this dedicated debug infrastructure).
> >
> > Oh, I'm coming in late, sorry
> >
> > Anyhow, I was thinking since we agreed this can trigger on some
> > CONFIG_DEBUG flag, something like
> >
> >     /* This is a sleepable region, but use preempt_disable to get debugging
> >      * for calls that are not allowed to block for OOM [.. insert
> >      * Michal's explanation.. ] */
> >     if (IS_ENABLED(CONFIG_DEBUG_ATOMIC_SLEEP) && !mmu_notifier_range_blockable(range))
> >         preempt_disable();
> >     ops->invalidate_range_start();
> 
> I think we also discussed that, and some expressed concerns it would
> change behaviour/timing too much for testing. Since this does does
> disable preemption for real, not just for might_sleep.

I don't follow, this is a debug kernel, it will have widly different
timing. 

Further the point of this debugging on atomic_sleep is to be as
timing-independent as possible since functions with rare sleeps should
be guarded by might_sleep() in their common paths.

I guess I don't get the push to have some low overhead debugging for
this? Is there something special you are looking for?

Jason
Jason Gunthorpe Aug. 16, 2019, 12:19 p.m. UTC | #46
On Fri, Aug 16, 2019 at 10:10:29AM +0200, Michal Hocko wrote:
> On Thu 15-08-19 17:13:23, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote:
> > 
> > > > The last detail is I'm still unclear what a GFP flags a blockable
> > > > invalidate_range_start() should use. Is GFP_KERNEL OK?
> > > 
> > > I hope I will not make this muddy again ;)
> > > invalidate_range_start in the blockable mode can use/depend on any sleepable
> > > allocation allowed in the context it is called from. 
> > 
> > 'in the context is is called from' is the magic phrase, as
> > invalidate_range_start is called while holding several different mm
> > related locks. I know at least write mmap_sem and i_mmap_rwsem
> > (write?)
> > 
> > Can GFP_KERNEL be called while holding those locks?
> 
> i_mmap_rwsem would be problematic because it is taken during the
> reclaim.

Okay.. So the fs_reclaim debugging does catch errors. Do you have any
reference for what a false positive looks like? 

I would like to inject it into the notifier path as this is very
difficult for driver authors to discover and know about, but I'm
worried about your false positive remark.

I think I understand we can use only GFP_ATOMIC in the notifiers, but
we need a strategy to handle OOM to guarentee forward progress.

This is just more bugs to fix :(

Jason
Michal Hocko Aug. 16, 2019, 12:26 p.m. UTC | #47
On Fri 16-08-19 09:19:06, Jason Gunthorpe wrote:
> On Fri, Aug 16, 2019 at 10:10:29AM +0200, Michal Hocko wrote:
> > On Thu 15-08-19 17:13:23, Jason Gunthorpe wrote:
> > > On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote:
> > > 
> > > > > The last detail is I'm still unclear what a GFP flags a blockable
> > > > > invalidate_range_start() should use. Is GFP_KERNEL OK?
> > > > 
> > > > I hope I will not make this muddy again ;)
> > > > invalidate_range_start in the blockable mode can use/depend on any sleepable
> > > > allocation allowed in the context it is called from. 
> > > 
> > > 'in the context is is called from' is the magic phrase, as
> > > invalidate_range_start is called while holding several different mm
> > > related locks. I know at least write mmap_sem and i_mmap_rwsem
> > > (write?)
> > > 
> > > Can GFP_KERNEL be called while holding those locks?
> > 
> > i_mmap_rwsem would be problematic because it is taken during the
> > reclaim.
> 
> Okay.. So the fs_reclaim debugging does catch errors.

I do not think fs_reclaim is the udnerlying mechanism to catch this
deadlock. It is a simple AA deadlock. You take i_mmap_rwsem and then
go down the allocation path, direct reclaim and take the lock again.
Nothing really surprising. fs_reclaim is really to catch GFP_NOFS
context calling into a less restricted (e.g. GFP_KERNEL allocation
context).

> Do you have any
> reference for what a false positive looks like? 

I believe I have given some examples when introducing __GFP_NOLOCKDEP.
 
> I would like to inject it into the notifier path as this is very
> difficult for driver authors to discover and know about, but I'm
> worried about your false positive remark.
> 
> I think I understand we can use only GFP_ATOMIC in the notifiers, but
> we need a strategy to handle OOM to guarentee forward progress.

Your example is from the notifier registration IIUC. Can you
pre-allocate before taking locks? Could you point me to some examples
when the allocation is necessary in the range notifier callback?
Daniel Vetter Aug. 16, 2019, 2:11 p.m. UTC | #48
On Fri, Aug 16, 2019 at 2:12 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Aug 16, 2019 at 08:20:55AM +0200, Daniel Vetter wrote:
> > On Fri, Aug 16, 2019 at 3:00 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > On Thu, Aug 15, 2019 at 10:49:31PM +0200, Daniel Vetter wrote:
> > > > On Thu, Aug 15, 2019 at 10:27 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > > On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote:
> > > > > > So if someone can explain to me how that works with lockdep I can of
> > > > > > course implement it. But afaics that doesn't exist (I tried to explain
> > > > > > that somewhere else already), and I'm no really looking forward to
> > > > > > hacking also on lockdep for this little series.
> > > > >
> > > > > Hmm, kind of looks like it is done by calling preempt_disable()
> > > >
> > > > Yup. That was v1, then came the suggestion that disabling preemption
> > > > is maybe not the best thing (the oom reaper could still run for a long
> > > > time comparatively, if it's cleaning out gigabytes of process memory
> > > > or what not, hence this dedicated debug infrastructure).
> > >
> > > Oh, I'm coming in late, sorry
> > >
> > > Anyhow, I was thinking since we agreed this can trigger on some
> > > CONFIG_DEBUG flag, something like
> > >
> > >     /* This is a sleepable region, but use preempt_disable to get debugging
> > >      * for calls that are not allowed to block for OOM [.. insert
> > >      * Michal's explanation.. ] */
> > >     if (IS_ENABLED(CONFIG_DEBUG_ATOMIC_SLEEP) && !mmu_notifier_range_blockable(range))
> > >         preempt_disable();
> > >     ops->invalidate_range_start();
> >
> > I think we also discussed that, and some expressed concerns it would
> > change behaviour/timing too much for testing. Since this does does
> > disable preemption for real, not just for might_sleep.
>
> I don't follow, this is a debug kernel, it will have widly different
> timing.
>
> Further the point of this debugging on atomic_sleep is to be as
> timing-independent as possible since functions with rare sleeps should
> be guarded by might_sleep() in their common paths.
>
> I guess I don't get the push to have some low overhead debugging for
> this? Is there something special you are looking for?

Don't ask me, I'm just trying to get _some_ debugging for this in. I
don't care one bit how much overhead it has because in our CI our
debug build has lockdep and everything and it crawls anyway. I started
out with the preempt_disable/enable thing like you suggested, and a
few rounds later we're here. We can go back to v1 on this one, but I'd
prefer to not do the lap too often.

Also, aside from this patch (which is prep for the next) and some
simple reordering conflicts they're all independent. So if there's no
way to paint this bikeshed here (technicolor perhaps?) then I'd like
to get at least the others considered.
-Daniel
Jason Gunthorpe Aug. 16, 2019, 2:31 p.m. UTC | #49
On Fri, Aug 16, 2019 at 02:26:25PM +0200, Michal Hocko wrote:
> On Fri 16-08-19 09:19:06, Jason Gunthorpe wrote:
> > On Fri, Aug 16, 2019 at 10:10:29AM +0200, Michal Hocko wrote:
> > > On Thu 15-08-19 17:13:23, Jason Gunthorpe wrote:
> > > > On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote:
> > > > 
> > > > > > The last detail is I'm still unclear what a GFP flags a blockable
> > > > > > invalidate_range_start() should use. Is GFP_KERNEL OK?
> > > > > 
> > > > > I hope I will not make this muddy again ;)
> > > > > invalidate_range_start in the blockable mode can use/depend on any sleepable
> > > > > allocation allowed in the context it is called from. 
> > > > 
> > > > 'in the context is is called from' is the magic phrase, as
> > > > invalidate_range_start is called while holding several different mm
> > > > related locks. I know at least write mmap_sem and i_mmap_rwsem
> > > > (write?)
> > > > 
> > > > Can GFP_KERNEL be called while holding those locks?
> > > 
> > > i_mmap_rwsem would be problematic because it is taken during the
> > > reclaim.
> > 
> > Okay.. So the fs_reclaim debugging does catch errors.
> 
> I do not think fs_reclaim is the udnerlying mechanism to catch this
> deadlock. 

Indeed lockdep would catch it directly as an AA deadlock, but only if
we happen to take the reclaim path under the kmalloc in the notifier
and lucked into it also choosing to lock the same lock we are holding.

The trouble is this is very difficult to hit.

lockdep allows making this less difficult. For instance with a
'might_reclaim()' annotation in the allocator we could check that the
various reclaim related locks are obtainable. Testing doesn't need to
get lucky and go down the very unlikely path.

It turns out this is already happening, it is actually a side effect
of the way fs_reclaim works now.

> > Do you have any
> > reference for what a false positive looks like? 
> 
> I believe I have given some examples when introducing __GFP_NOLOCKDEP.

Okay, I think that is 7e7844226f10 ("lockdep: allow to disable reclaim
lockup detection") Hmm, sadly the lkml link in the commit is broken.

Hum. There are no users of __GFP_NOLOCKDEP today?? Could all the false
positives have been fixed??

Keep in mind that this fs_reclaim was reworked away from the hacky
interrupt context flag to a fairly elegant real lock map.

Based on my read of the commit message, my first reaction would be to
suggest lockdep_set_class() to solve the problem described, ie
different classes for 'inside transaction' and 'outside transaction'
on xfs_refcountbt_init_cursor()

I understood that generally that is the way to handle lockdep false
positives.

Anyhow, if you are willing to consider that lockdep isn't broken, I
have some ideas on how to make this clearer and increase
coverage. Would you be willing to look at patches on this topic? (not
soon, I have to finish my mmu notifier stuff)

> > I would like to inject it into the notifier path as this is very
> > difficult for driver authors to discover and know about, but I'm
> > worried about your false positive remark.
> > 
> > I think I understand we can use only GFP_ATOMIC in the notifiers, but
> > we need a strategy to handle OOM to guarentee forward progress.
> 
> Your example is from the notifier registration IIUC. 

Yes, that is where this commit hit it.. Triggering this under an
actual notifier to get a lockdep report is hard.

> Can you pre-allocate before taking locks? Could you point me to some
> examples when the allocation is necessary in the range notifier
> callback?

Hmm. I took a careful look, I only found mlx5 as obviously allocating
memory:

 mlx5_ib_invalidate_range()
  mlx5_ib_update_xlt()
   __get_free_pages(gfp, get_order(size));

However, I think this could be changed to fall back to some small
buffer if allocation fails. The existing scheme looks sketchy

nouveau does:

 nouveau_svmm_invalidate
  nvif_object_mthd
   kmalloc(GFP_KERNEL)

But I think it reliably uses a stack buffer here

i915 I think Daniel said he audited.

amd_mn.. The actual invalidate_range_start does not allocate memory,
but it is entangled with so many locks it would need careful analysis
to be sure.

The others look generally OK, which is good, better than I hoped :)

Jason
Jason Gunthorpe Aug. 16, 2019, 2:38 p.m. UTC | #50
On Fri, Aug 16, 2019 at 04:11:34PM +0200, Daniel Vetter wrote:
> Also, aside from this patch (which is prep for the next) and some
> simple reordering conflicts they're all independent. So if there's no
> way to paint this bikeshed here (technicolor perhaps?) then I'd like
> to get at least the others considered.

Sure, I think for conflict avoidance reasons I'm probably taking
mmu_notifier stuff via hmm.git, so:

- Andrew had a minor remark on #1, I am ambivalent and would take it
  as-is. Your decision if you want to respin.
- #2/#3 is this issue, I would stand by the preempt_disable/etc path
  Our situation matches yours, debug tests run lockdep/etc.
- #4 I like a lot, except the map should enclose range_end too,
  this can be done after the mm_has_notifiers inside the
  __mmu_notifier function
  Can you respin?
  I will propose preloading the map in another patch
- #5 is already applied in -rc

Jason
Jerome Glisse Aug. 16, 2019, 3:05 p.m. UTC | #51
On Fri, Aug 16, 2019 at 11:31:45AM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 16, 2019 at 02:26:25PM +0200, Michal Hocko wrote:
> > On Fri 16-08-19 09:19:06, Jason Gunthorpe wrote:
> > > On Fri, Aug 16, 2019 at 10:10:29AM +0200, Michal Hocko wrote:
> > > > On Thu 15-08-19 17:13:23, Jason Gunthorpe wrote:
> > > > > On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote:

[...]

> > > I would like to inject it into the notifier path as this is very
> > > difficult for driver authors to discover and know about, but I'm
> > > worried about your false positive remark.
> > > 
> > > I think I understand we can use only GFP_ATOMIC in the notifiers, but
> > > we need a strategy to handle OOM to guarentee forward progress.
> > 
> > Your example is from the notifier registration IIUC. 
> 
> Yes, that is where this commit hit it.. Triggering this under an
> actual notifier to get a lockdep report is hard.
> 
> > Can you pre-allocate before taking locks? Could you point me to some
> > examples when the allocation is necessary in the range notifier
> > callback?
> 
> Hmm. I took a careful look, I only found mlx5 as obviously allocating
> memory:
> 
>  mlx5_ib_invalidate_range()
>   mlx5_ib_update_xlt()
>    __get_free_pages(gfp, get_order(size));
> 
> However, I think this could be changed to fall back to some small
> buffer if allocation fails. The existing scheme looks sketchy
> 
> nouveau does:
> 
>  nouveau_svmm_invalidate
>   nvif_object_mthd
>    kmalloc(GFP_KERNEL)
> 
> But I think it reliably uses a stack buffer here
> 
> i915 I think Daniel said he audited.
> 
> amd_mn.. The actual invalidate_range_start does not allocate memory,
> but it is entangled with so many locks it would need careful analysis
> to be sure.
> 
> The others look generally OK, which is good, better than I hoped :)

It is on my TODO list to get rid of allocation in notifier callback
(iirc nouveau already use the stack unless it was lost in all the
revision it wants through). Anyway i do not think we need allocation
in notifier.

Cheers,
Jérôme
Daniel Vetter Aug. 16, 2019, 4:36 p.m. UTC | #52
On Fri, Aug 16, 2019 at 4:38 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Aug 16, 2019 at 04:11:34PM +0200, Daniel Vetter wrote:
> > Also, aside from this patch (which is prep for the next) and some
> > simple reordering conflicts they're all independent. So if there's no
> > way to paint this bikeshed here (technicolor perhaps?) then I'd like
> > to get at least the others considered.
>
> Sure, I think for conflict avoidance reasons I'm probably taking
> mmu_notifier stuff via hmm.git, so:
>
> - Andrew had a minor remark on #1, I am ambivalent and would take it
>   as-is. Your decision if you want to respin.

I like mine better, see also the reply from Ralph Campbell.

> - #2/#3 is this issue, I would stand by the preempt_disable/etc path
>   Our situation matches yours, debug tests run lockdep/etc.

Since Michal requested the current flavour I think we need spin a bit
more on these here. I guess I'll just rebase them to the end so
they're not holding up the others.

> - #4 I like a lot, except the map should enclose range_end too,
>   this can be done after the mm_has_notifiers inside the
>   __mmu_notifier function

To make sure I get this right: The same lockdep context, but also
wrapped around invalidate_range_end? From my understanding of pte
zapping that makes sense, but I'm definitely not well-versed enough
for that.

>   Can you respin?

Will do.

>   I will propose preloading the map in another patch
> - #5 is already applied in -rc

Yup, I'll drop that one.

Thanks, Daniel
Jason Gunthorpe Aug. 16, 2019, 4:54 p.m. UTC | #53
On Fri, Aug 16, 2019 at 06:36:52PM +0200, Daniel Vetter wrote:
> On Fri, Aug 16, 2019 at 4:38 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Fri, Aug 16, 2019 at 04:11:34PM +0200, Daniel Vetter wrote:
> > > Also, aside from this patch (which is prep for the next) and some
> > > simple reordering conflicts they're all independent. So if there's no
> > > way to paint this bikeshed here (technicolor perhaps?) then I'd like
> > > to get at least the others considered.
> >
> > Sure, I think for conflict avoidance reasons I'm probably taking
> > mmu_notifier stuff via hmm.git, so:
> >
> > - Andrew had a minor remark on #1, I am ambivalent and would take it
> >   as-is. Your decision if you want to respin.
> 
> I like mine better, see also the reply from Ralph Campbell.

Sure

> > - #2/#3 is this issue, I would stand by the preempt_disable/etc path
> >   Our situation matches yours, debug tests run lockdep/etc.
>
> Since Michal requested the current flavour I think we need spin a bit
> more on these here. I guess I'll just rebase them to the end so
> they're not holding up the others.
> 
> > - #4 I like a lot, except the map should enclose range_end too,
> >   this can be done after the mm_has_notifiers inside the
> >   __mmu_notifier function
> 
> To make sure I get this right: The same lockdep context, but also
> wrapped around invalidate_range_end? 

Yes, the locking context of _range_start and _range_end should be
identical, last time I checked callers this was the case.

So, just add it to __mmu_notifier_invalidate_range_end() outside the
SRCU as there is no reason to burden debug kernel callers twice when
mmu notifiers are not enabled

Jason
Michal Hocko Aug. 20, 2019, 8:18 a.m. UTC | #54
On Fri 16-08-19 11:31:45, Jason Gunthorpe wrote:
> On Fri, Aug 16, 2019 at 02:26:25PM +0200, Michal Hocko wrote:
[...]
> > I believe I have given some examples when introducing __GFP_NOLOCKDEP.
> 
> Okay, I think that is 7e7844226f10 ("lockdep: allow to disable reclaim
> lockup detection") Hmm, sadly the lkml link in the commit is broken.
> 
> Hum. There are no users of __GFP_NOLOCKDEP today?? Could all the false
> positives have been fixed??

I would be more than surprised. Those false possitives were usually
papered over by using GFP_NOFS. And the original plan was to convert
those back to GFP_KERNEL like allocations and use __GFP_NOLOCKDEP.
 
> Keep in mind that this fs_reclaim was reworked away from the hacky
> interrupt context flag to a fairly elegant real lock map.

I am glad to hear that because that was just too ugly to live.

> Based on my read of the commit message, my first reaction would be to
> suggest lockdep_set_class() to solve the problem described, ie
> different classes for 'inside transaction' and 'outside transaction'
> on xfs_refcountbt_init_cursor()

No this just turned out to be unmaintainable. The number of lock classes
was growing high. I recommend on of the Dave Chinner's rant. Sorry not
link handy.

> I understood that generally that is the way to handle lockdep false
> positives.
> 
> Anyhow, if you are willing to consider that lockdep isn't broken, I
> have some ideas on how to make this clearer and increase
> coverage. Would you be willing to look at patches on this topic? (not
> soon, I have to finish my mmu notifier stuff)

I haven't claimed that the lockdep is broken. It just had problems to
capture some code paths and generated false positives. I would recommend
talking to lockdep maintainers much more than to me because I would have
to dive into the code much more to be useful. I can still comment on the
MM side of the thing of course if that is helpful.

> > > I would like to inject it into the notifier path as this is very
> > > difficult for driver authors to discover and know about, but I'm
> > > worried about your false positive remark.
> > > 
> > > I think I understand we can use only GFP_ATOMIC in the notifiers, but
> > > we need a strategy to handle OOM to guarentee forward progress.
> > 
> > Your example is from the notifier registration IIUC. 
> 
> Yes, that is where this commit hit it.. Triggering this under an
> actual notifier to get a lockdep report is hard.

All you need is to generate a memory pressure. That shouldn't be that
hard.
diff mbox series

Patch

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 4fa360a13c1e..915fd9888afb 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -217,7 +217,9 @@  extern void __cant_sleep(const char *file, int line, int preempt_offset);
  * might_sleep - annotation for functions that can sleep
  *
  * this macro will print a stack trace if it is executed in an atomic
- * context (spinlock, irq-handler, ...).
+ * context (spinlock, irq-handler, ...). Additional sections where blocking is
+ * not allowed can be annotated with non_block_start() and non_block_end()
+ * pairs.
  *
  * This is a useful debugging help to be able to catch problems early and not
  * be bitten later when the calling function happens to sleep when it is not
@@ -233,6 +235,10 @@  extern void __cant_sleep(const char *file, int line, int preempt_offset);
 # define cant_sleep() \
 	do { __cant_sleep(__FILE__, __LINE__, 0); } while (0)
 # define sched_annotate_sleep()	(current->task_state_change = 0)
+# define non_block_start() \
+	do { current->non_block_count++; } while (0)
+# define non_block_end() \
+	do { WARN_ON(current->non_block_count-- == 0); } while (0)
 #else
   static inline void ___might_sleep(const char *file, int line,
 				   int preempt_offset) { }
@@ -241,6 +247,8 @@  extern void __cant_sleep(const char *file, int line, int preempt_offset);
 # define might_sleep() do { might_resched(); } while (0)
 # define cant_sleep() do { } while (0)
 # define sched_annotate_sleep() do { } while (0)
+# define non_block_start() do { } while (0)
+# define non_block_end() do { } while (0)
 #endif
 
 #define might_sleep_if(cond) do { if (cond) might_sleep(); } while (0)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9f51932bd543..c5630f3dca1f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -974,6 +974,10 @@  struct task_struct {
 	struct mutex_waiter		*blocked_on;
 #endif
 
+#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
+	int				non_block_count;
+#endif
+
 #ifdef CONFIG_TRACE_IRQFLAGS
 	unsigned int			irq_events;
 	unsigned long			hardirq_enable_ip;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2b037f195473..57245770d6cc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3700,13 +3700,22 @@  static noinline void __schedule_bug(struct task_struct *prev)
 /*
  * Various schedule()-time debugging checks and statistics:
  */
-static inline void schedule_debug(struct task_struct *prev)
+static inline void schedule_debug(struct task_struct *prev, bool preempt)
 {
 #ifdef CONFIG_SCHED_STACK_END_CHECK
 	if (task_stack_end_corrupted(prev))
 		panic("corrupted stack end detected inside scheduler\n");
 #endif
 
+#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
+	if (!preempt && prev->state && prev->non_block_count) {
+		printk(KERN_ERR "BUG: scheduling in a non-blocking section: %s/%d/%i\n",
+			prev->comm, prev->pid, prev->non_block_count);
+		dump_stack();
+		add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
+	}
+#endif
+
 	if (unlikely(in_atomic_preempt_off())) {
 		__schedule_bug(prev);
 		preempt_count_set(PREEMPT_DISABLED);
@@ -3813,7 +3822,7 @@  static void __sched notrace __schedule(bool preempt)
 	rq = cpu_rq(cpu);
 	prev = rq->curr;
 
-	schedule_debug(prev);
+	schedule_debug(prev, preempt);
 
 	if (sched_feat(HRTICK))
 		hrtick_clear(rq);
@@ -6570,7 +6579,7 @@  void ___might_sleep(const char *file, int line, int preempt_offset)
 	rcu_sleep_check();
 
 	if ((preempt_count_equals(preempt_offset) && !irqs_disabled() &&
-	     !is_idle_task(current)) ||
+	     !is_idle_task(current) && !current->non_block_count) ||
 	    system_state == SYSTEM_BOOTING || system_state > SYSTEM_RUNNING ||
 	    oops_in_progress)
 		return;
@@ -6586,8 +6595,8 @@  void ___might_sleep(const char *file, int line, int preempt_offset)
 		"BUG: sleeping function called from invalid context at %s:%d\n",
 			file, line);
 	printk(KERN_ERR
-		"in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n",
-			in_atomic(), irqs_disabled(),
+		"in_atomic(): %d, irqs_disabled(): %d, non_block: %d, pid: %d, name: %s\n",
+			in_atomic(), irqs_disabled(), current->non_block_count,
 			current->pid, current->comm);
 
 	if (task_stack_end_corrupted(current))