[2/5] kernel.h: Add non_block_start/end()

Message ID	20190814202027.18735-3-daniel.vetter@ffwll.ch (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Daniel Vetter <daniel.vetter@ffwll.ch> To: LKML <linux-kernel@vger.kernel.org> Date: Wed, 14 Aug 2019 22:20:24 +0200 Message-Id: <20190814202027.18735-3-daniel.vetter@ffwll.ch> In-Reply-To: <20190814202027.18735-1-daniel.vetter@ffwll.ch> References: <20190814202027.18735-1-daniel.vetter@ffwll.ch> MIME-Version: 1.0 X-Mailman-Original-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=JoaO8Qe6CY1dfta4yCfl+sTgCnZHzLR+xJsBydvmafY=; b=TB/fDwiRpXDbYnbebe5pI330088X9XJlGQF+2IhRTfDYTMEq0INaebPwZsNsT8LD80 +k2xt/FCnylJAZNGoH8zeEPnqj8+2txInoicG9vpjAfteV/ICjiT5gaNMKKQtYsYkrzg KNkRLGmFFbDP5EJsNKo7E/q8ZnanOmEVhmVQY= Subject: [Intel-gfx] [PATCH 2/5] kernel.h: Add non_block_start/end() Precedence: list Cc: Feng Tang <feng.tang@intel.com>, Michal Hocko <mhocko@suse.com>, Kees Cook <keescook@chromium.org>, linux-mm@kvack.org, Peter Zijlstra <peterz@infradead.org>, Daniel Vetter <daniel.vetter@ffwll.ch>, Intel Graphics Development <intel-gfx@lists.freedesktop.org>, Jann Horn <jannh@google.com>, DRI Development <dri-devel@lists.freedesktop.org>, Jason Gunthorpe <jgg@ziepe.ca>, =?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= <jglisse@redhat.com>, Ingo Molnar <mingo@redhat.com>, Thomas Gleixner <tglx@linutronix.de>, Randy Dunlap <rdunlap@infradead.org>, David Rientjes <rientjes@google.com>, Wei Wang <wvw@google.com>, Daniel Vetter <daniel.vetter@intel.com>, Andrew Morton <akpm@linux-foundation.org>, Andy Shevchenko <andriy.shevchenko@linux.intel.com>, =?utf-8?q?Christian_K?= =?utf-8?q?=C3=B6nig?= <christian.koenig@amd.com>, Masahiro Yamada <yamada.masahiro@socionext.com> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	hmm & mmu_notifier debug/lockdep annotations \| expand [0/5] hmm & mmu_notifier debug/lockdep annotations [1/5] mm: Check if mmu notifier callbacks are allowed to fail [2/5] kernel.h: Add non_block_start/end() [3/5] mm, notifier: Catch sleeping/blocking for !blockable [4/5] mm, notifier: Add a lockdep map for invalidate_range_start [5/5] mm/hmm: WARN on illegal ->sync_cpu_device_pagetables errors

Daniel Vetter Aug. 14, 2019, 8:20 p.m. UTC

In some special cases we must not block, but there's not a
spinlock, preempt-off, irqs-off or similar critical section already
that arms the might_sleep() debug checks. Add a non_block_start/end()
pair to annotate these.

This will be used in the oom paths of mmu-notifiers, where blocking is
not allowed to make sure there's forward progress. Quoting Michal:

"The notifier is called from quite a restricted context - oom_reaper -
which shouldn't depend on any locks or sleepable conditionals. The code
should be swift as well but we mostly do care about it to make a forward
progress. Checking for sleepable context is the best thing we could come
up with that would describe these demands at least partially."

Peter also asked whether we want to catch spinlocks on top, but Michal
said those are less of a problem because spinlocks can't have an
indirect dependency upon the page allocator and hence close the loop
with the oom reaper.

Suggested by Michal Hocko.

v2:
- Improve commit message (Michal)
- Also check in schedule, not just might_sleep (Peter)

v3: It works better when I actually squash in the fixup I had lying
around :-/

Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: linux-mm@kvack.org
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Wei Wang <wvw@google.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jannh@google.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: linux-kernel@vger.kernel.org
Acked-by: Christian König <christian.koenig@amd.com> (v1)
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 include/linux/kernel.h | 10 +++++++++-
 include/linux/sched.h  |  4 ++++
 kernel/sched/core.c    | 19 ++++++++++++++-----
 3 files changed, 27 insertions(+), 6 deletions(-)

Andrew Morton Aug. 14, 2019, 8:45 p.m. UTC | #1

On Wed, 14 Aug 2019 22:20:24 +0200 Daniel Vetter <daniel.vetter@ffwll.ch> wrote:

> In some special cases we must not block, but there's not a
> spinlock, preempt-off, irqs-off or similar critical section already
> that arms the might_sleep() debug checks. Add a non_block_start/end()
> pair to annotate these.
> 
> This will be used in the oom paths of mmu-notifiers, where blocking is
> not allowed to make sure there's forward progress. Quoting Michal:
> 
> "The notifier is called from quite a restricted context - oom_reaper -
> which shouldn't depend on any locks or sleepable conditionals. The code
> should be swift as well but we mostly do care about it to make a forward
> progress. Checking for sleepable context is the best thing we could come
> up with that would describe these demands at least partially."
> 
> Peter also asked whether we want to catch spinlocks on top, but Michal
> said those are less of a problem because spinlocks can't have an
> indirect dependency upon the page allocator and hence close the loop
> with the oom reaper.

I continue to struggle with this.  It introduces a new kernel state
"running preemptibly but must not call schedule()".  How does this make
any sense?

Perhaps a much, much more detailed description of the oom_reaper
situation would help out.

Daniel Vetter Aug. 15, 2019, 6:52 a.m. UTC | #2

On Wed, Aug 14, 2019 at 01:45:58PM -0700, Andrew Morton wrote:
> On Wed, 14 Aug 2019 22:20:24 +0200 Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> 
> > In some special cases we must not block, but there's not a
> > spinlock, preempt-off, irqs-off or similar critical section already
> > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > pair to annotate these.
> > 
> > This will be used in the oom paths of mmu-notifiers, where blocking is
> > not allowed to make sure there's forward progress. Quoting Michal:
> > 
> > "The notifier is called from quite a restricted context - oom_reaper -
> > which shouldn't depend on any locks or sleepable conditionals. The code
> > should be swift as well but we mostly do care about it to make a forward
> > progress. Checking for sleepable context is the best thing we could come
> > up with that would describe these demands at least partially."
> > 
> > Peter also asked whether we want to catch spinlocks on top, but Michal
> > said those are less of a problem because spinlocks can't have an
> > indirect dependency upon the page allocator and hence close the loop
> > with the oom reaper.
> 
> I continue to struggle with this.  It introduces a new kernel state
> "running preemptibly but must not call schedule()".  How does this make
> any sense?
> 
> Perhaps a much, much more detailed description of the oom_reaper
> situation would help out.

I agree on both points, but I guess I'm not the expert to explain why we
have this. All I'm trying to do is that drivers hold up their side. If you
want to have better documentation for why the oom case needs this special
new mode, you're looking at the wrong guy for that :-)

Of course if you folks all decide that you just don't want to be
remembered about that I guess we can drop this one here, but you're just
shooting the messenger I think.
-Daniel

Daniel Vetter Aug. 15, 2019, 6:58 a.m. UTC | #3

On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote:
> On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote:
> > In some special cases we must not block, but there's not a
> > spinlock, preempt-off, irqs-off or similar critical section already
> > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > pair to annotate these.
> > 
> > This will be used in the oom paths of mmu-notifiers, where blocking is
> > not allowed to make sure there's forward progress. Quoting Michal:
> > 
> > "The notifier is called from quite a restricted context - oom_reaper -
> > which shouldn't depend on any locks or sleepable conditionals. The code
> > should be swift as well but we mostly do care about it to make a forward
> > progress. Checking for sleepable context is the best thing we could come
> > up with that would describe these demands at least partially."
> 
> But this describes fs_reclaim_acquire() - is there some reason we are
> conflating fs_reclaim with non-sleeping?

No idea why you tie this into fs_reclaim. We can definitly sleep in there,
and for e.g. kswapd (which also wraps everything in fs_reclaim) we're
event supposed to I thought. To make sure we can get at the last bit of
memory by flushing all the queues and waiting for everything to be cleaned
out.

> ie is there some fundamental difference between the block stack
> sleeping during reclaim while it waits for a driver to write out a
> page and a GPU driver sleeping during OOM while it waits for it's HW
> to fence DMA on a page?
> 
> Fundamentally we have invalidate_range_start() vs invalidate_range()
> as the start() version is able to sleep. If drivers can do their work
> without sleeping then they should be using invalidare_range() instead.
> 
> Thus, it doesn't seem to make any sense to ask a driver that requires a
> sleeping API not to sleep.
> 
> AFAICT what is really going on here is that drivers care about only a
> subset of the VA space, and we want to query the driver if it cares
> about the range proposed to be OOM'd, so we can OOM ranges that are
> do not have SPTEs.
> 
> ie if you look pretty much all drivers do exactly as
> userptr_mn_invalidate_range_start() does, and bail once they detect
> the VA range is of interest.
> 
> So, I'm working on a patch to lift the interval tree into the notifier
> core and then do the VA test OOM needs without bothering the
> driver. Drivers can retain the blocking API they require and OOM can
> work on VA's that don't have SPTEs.

Hm I figured the point of hmm_mirror is to have that interval tree for
everyone (among other things). But yeah lifting to mmu_notifier sounds
like a clean solution for this, but I really have not much clue about why
we even have this for special mode in the oom case. I'm just trying to
increase the odds that drivers hold up their end of the bargain.

> This approach also solves the critical bug in this path:
>   https://lore.kernel.org/linux-mm/20190807191627.GA3008@ziepe.ca/
> 
> And solves a bunch of other bugs in the drivers.
> 
> > Peter also asked whether we want to catch spinlocks on top, but Michal
> > said those are less of a problem because spinlocks can't have an
> > indirect dependency upon the page allocator and hence close the loop
> > with the oom reaper.
> 
> Again, this entirely sounds like fs_reclaim - isn't that exactly what
> it is for?
> 
> I have had on my list a second and very related possible bug. I ran
> into commit 35cfa2b0b491 ("mm/mmu_notifier: allocate mmu_notifier in
> advance") which says that mapping->i_mmap_mutex is under fs_reclaim().
> 
> We do hold i_mmap_rwsem while calling invalidate_range_start():
> 
>  unmap_mapping_pages
>   i_mmap_lock_write(mapping); // ie i_mmap_rwsem
>   unmap_mapping_range_tree
>    unmap_mapping_range_vma
>     zap_page_range_single
>      mmu_notifier_invalidate_range_start
> 
> So, if it is still true that i_mmap_rwsem is under fs_reclaim then
> invalidate_range_start is *always* under fs_reclaim anyhow! (this I do
> not know)
> 
> Thus we should use lockdep to force this and fix all the drivers.
> 
> .. and if we force fs_reclaim always, do we care about blockable
> anymore??

Still not sure what fs_reclaim matters here. One of the later patches
steals the fs_reclaim idea for mmu_notifiers, but that ties together
completely different paths.
-Daniel

Michal Hocko Aug. 15, 2019, 8:44 a.m. UTC | #4

On Wed 14-08-19 13:45:58, Andrew Morton wrote:
> On Wed, 14 Aug 2019 22:20:24 +0200 Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> 
> > In some special cases we must not block, but there's not a
> > spinlock, preempt-off, irqs-off or similar critical section already
> > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > pair to annotate these.
> > 
> > This will be used in the oom paths of mmu-notifiers, where blocking is
> > not allowed to make sure there's forward progress. Quoting Michal:
> > 
> > "The notifier is called from quite a restricted context - oom_reaper -
> > which shouldn't depend on any locks or sleepable conditionals. The code
> > should be swift as well but we mostly do care about it to make a forward
> > progress. Checking for sleepable context is the best thing we could come
> > up with that would describe these demands at least partially."
> > 
> > Peter also asked whether we want to catch spinlocks on top, but Michal
> > said those are less of a problem because spinlocks can't have an
> > indirect dependency upon the page allocator and hence close the loop
> > with the oom reaper.
> 
> I continue to struggle with this.  It introduces a new kernel state
> "running preemptibly but must not call schedule()".  How does this make
> any sense?
> 
> Perhaps a much, much more detailed description of the oom_reaper
> situation would help out.

The primary point here is that there is a demand of non blockable mmu
notifiers to be called when the oom reaper tears down the address space.
As the oom reaper is the primary guarantee of the oom handling forward
progress it cannot be blocked on anything that might depend on blockable
memory allocations. These are not really easy to track because they
might be indirect - e.g. notifier blocks on a lock which other context
holds while allocating memory or waiting for a flusher that needs memory
to perform its work. If such a blocking state happens that we can end up
in a silent hang with an unusable machine.
Now we hope for reasonable implementations of mmu notifiers (strong
words I know ;) and this should be relatively simple and effective catch
all tool to detect something suspicious is going on.

Does that make the situation more clear?

Daniel Vetter Aug. 15, 2019, 1:12 p.m. UTC | #5

On Thu, Aug 15, 2019 at 3:04 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote:
>
> > As the oom reaper is the primary guarantee of the oom handling forward
> > progress it cannot be blocked on anything that might depend on blockable
> > memory allocations. These are not really easy to track because they
> > might be indirect - e.g. notifier blocks on a lock which other context
> > holds while allocating memory or waiting for a flusher that needs memory
> > to perform its work.
>
> But lockdep *does* track all this and fs_reclaim_acquire() was created
> to solve exactly this problem.
>
> fs_reclaim is a lock and it flows through all the usual lockdep
> schemes like any other lock. Any time the page allocator wants to do
> something the would deadlock with reclaim it takes the lock.
>
> Failure is expressed by a deadlock cycle in the lockdep map, and
> lockdep can handle arbitary complexity through layers of locks, work
> queues, threads, etc.
>
> What is missing?

Lockdep doens't seen everything by far. E.g. a wait_event will be
caught by the annotations here, but not by lockdep. Plus lockdep does
not see through the wait_event, and doesn't realize if e.g. that event
will never signal because the worker is part of the deadlock loop.
cross-release was supposed to fix that, but seems like that will never
land.

And since we're talking about mmu notifiers here and gpus/dma engines.
We have dma_fence_wait, which can wait for any hw/driver in the system
that takes part in shared/zero-copy buffer processing. Which at least
on the graphics side is everything. This pulls in enormous amounts of
deadlock potential that lockdep simply is blind about and will never
see.

Arming might_sleep catches them all.

Cheers, Daniel

PS: Don't ask me about why we need these semantics for oom_reaper,
like I said just trying to follow the rules.

Michal Hocko Aug. 15, 2019, 1:21 p.m. UTC | #6

On Thu 15-08-19 09:23:44, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 08:58:29AM +0200, Daniel Vetter wrote:
> > On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote:
> > > > In some special cases we must not block, but there's not a
> > > > spinlock, preempt-off, irqs-off or similar critical section already
> > > > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > > > pair to annotate these.
> > > > 
> > > > This will be used in the oom paths of mmu-notifiers, where blocking is
> > > > not allowed to make sure there's forward progress. Quoting Michal:
> > > > 
> > > > "The notifier is called from quite a restricted context - oom_reaper -
> > > > which shouldn't depend on any locks or sleepable conditionals. The code
> > > > should be swift as well but we mostly do care about it to make a forward
> > > > progress. Checking for sleepable context is the best thing we could come
> > > > up with that would describe these demands at least partially."
> > > 
> > > But this describes fs_reclaim_acquire() - is there some reason we are
> > > conflating fs_reclaim with non-sleeping?
> > 
> > No idea why you tie this into fs_reclaim. We can definitly sleep in there,
> > and for e.g. kswapd (which also wraps everything in fs_reclaim) we're
> > event supposed to I thought. To make sure we can get at the last bit of
> > memory by flushing all the queues and waiting for everything to be cleaned
> > out.
> 
> AFAIK the point of fs_reclaim is to prevent "indirect dependency upon
> the page allocator" ie a justification that was given this !blockable
> stuff.
> 
> For instance:
> 
>   fs_reclaim_acquire()
>   kmalloc(GFP_KERNEL) <- lock dep assertion
> 
> And further, Michal's concern about indirectness through locks is also
> handled by lockdep:
> 
>        CPU0                                 CPU1
>                                         mutex_lock()
>                                         kmalloc(GFP_KERNEL)
>                                         mutex_unlock()
>   fs_reclaim_acquire()
>   mutex_lock() <- lock dep assertion
> 
> In other words, to prevent recursion into the page allocator you use
> fs_reclaim_acquire(), and lockdep verfies it in its usual robust way.

fs_reclaim_acquire is about FS/IO recursions IIUC. We are talking about
any !GFP_NOWAIT allocation context here and any {in}direct dependency on
it. Whether fs_reclaim_acquire can be reused for that I do not know
because I am not familiar with the lockdep machinery enough
 
> I asked Tejun about this once in regards to WQ_MEM_RECLAIM and he
> explained that it means you can't call the allocator functions in a
> way that would recurse into reclaim (ie instead use instead GFP_ATOMIC, or
> tolerate allocation failure, or various other things).
> 
> So, the reason I bring it up is half the justifications you posted for
> blockable had to do with not recursing into reclaim and deadlocking,
> and didn't seem to have much to do with blocking.
> 
> I'm asking if *non-blocking* is really the requirement or if this is
> just the usual 'do not deadlock on the allocator' thing reclaim paths
> alread have?

No, non-blocking is a very coarse approximation of what we really need.
But it should give us even a stronger condition. Essentially any sleep
other than a preemption shouldn't be allowed in that context.

Michal Hocko Aug. 15, 2019, 1:24 p.m. UTC | #7

On Thu 15-08-19 10:04:15, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote:
> 
> > As the oom reaper is the primary guarantee of the oom handling forward
> > progress it cannot be blocked on anything that might depend on blockable
> > memory allocations. These are not really easy to track because they
> > might be indirect - e.g. notifier blocks on a lock which other context
> > holds while allocating memory or waiting for a flusher that needs memory
> > to perform its work.
> 
> But lockdep *does* track all this and fs_reclaim_acquire() was created
> to solve exactly this problem.
> 
> fs_reclaim is a lock and it flows through all the usual lockdep
> schemes like any other lock. Any time the page allocator wants to do
> something the would deadlock with reclaim it takes the lock.

Our emails have crossed. Even if fs_reclaim can be re-purposed for other
than FS/IO reclaim contexts which I am not sure about I think that
lockdep is too heavy weight for the purpose of this debugging aid in the
first place. The non block mode is really something as simple as "hey
mmu notifier you are only allowed to do a lightweight processing, if you
cannot guarantee that then just back off". The annotation will give us a
warning if the notifier gets out of this promise.

Daniel Vetter Aug. 15, 2019, 2:43 p.m. UTC | #8

On Thu, Aug 15, 2019 at 4:38 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Aug 15, 2019 at 03:12:11PM +0200, Daniel Vetter wrote:
> > On Thu, Aug 15, 2019 at 3:04 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote:
> > >
> > > > As the oom reaper is the primary guarantee of the oom handling forward
> > > > progress it cannot be blocked on anything that might depend on blockable
> > > > memory allocations. These are not really easy to track because they
> > > > might be indirect - e.g. notifier blocks on a lock which other context
> > > > holds while allocating memory or waiting for a flusher that needs memory
> > > > to perform its work.
> > >
> > > But lockdep *does* track all this and fs_reclaim_acquire() was created
> > > to solve exactly this problem.
> > >
> > > fs_reclaim is a lock and it flows through all the usual lockdep
> > > schemes like any other lock. Any time the page allocator wants to do
> > > something the would deadlock with reclaim it takes the lock.
> > >
> > > Failure is expressed by a deadlock cycle in the lockdep map, and
> > > lockdep can handle arbitary complexity through layers of locks, work
> > > queues, threads, etc.
> > >
> > > What is missing?
> >
> > Lockdep doens't seen everything by far. E.g. a wait_event will be
> > caught by the annotations here, but not by lockdep.
>
> Sure, but the wait_event might be OK if its progress isn't contingent
> on fs_reclaim, ie triggered from interrupt, so why ban it?

For normal notifiers sure (but lockdep won't help you proof that at
all). For oom_reaper apparently not, but that's really Michal's thing,
I have not much idea about that.

> > And since we're talking about mmu notifiers here and gpus/dma engines.
> > We have dma_fence_wait, which can wait for any hw/driver in the system
> > that takes part in shared/zero-copy buffer processing. Which at least
> > on the graphics side is everything. This pulls in enormous amounts of
> > deadlock potential that lockdep simply is blind about and will never
> > see.
>
> It seems very risky to entagle a notifier widely like that.

That's why I've looked into all possible ways to annotate them with
debug infrastructure.

> It looks pretty sure that notifiers are fs_reclaim, so at a minimum
> that wait_event can't be contingent on anything that is doing
> GFP_KERNEL or it will deadlock - and blockable doesn't make that sleep
> safe.
>
> Avoiding an uncertain wait_event under notifiers would seem to be the
> only reasonable design here..

You have to wait for the gpu to finnish current processing in
invalidate_range_start. Otherwise there's no point to any of this
really. So the wait_event/dma_fence_wait are unavoidable really.

That's also why I'm throwing in the lockdep annotation on top, and why
it would be really nice if we somehow can get the cross-release work
landed. But it looks like all the people who could make it happen are
busy with smeltdown :-/
-Daniel

Michal Hocko Aug. 15, 2019, 4 p.m. UTC | #9

On Thu 15-08-19 11:12:19, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 03:21:27PM +0200, Michal Hocko wrote:
> > On Thu 15-08-19 09:23:44, Jason Gunthorpe wrote:
> > > On Thu, Aug 15, 2019 at 08:58:29AM +0200, Daniel Vetter wrote:
> > > > On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote:
> > > > > On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote:
> > > > > > In some special cases we must not block, but there's not a
> > > > > > spinlock, preempt-off, irqs-off or similar critical section already
> > > > > > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > > > > > pair to annotate these.
> > > > > > 
> > > > > > This will be used in the oom paths of mmu-notifiers, where blocking is
> > > > > > not allowed to make sure there's forward progress. Quoting Michal:
> > > > > > 
> > > > > > "The notifier is called from quite a restricted context - oom_reaper -
> > > > > > which shouldn't depend on any locks or sleepable conditionals. The code
> > > > > > should be swift as well but we mostly do care about it to make a forward
> > > > > > progress. Checking for sleepable context is the best thing we could come
> > > > > > up with that would describe these demands at least partially."
> > > > > 
> > > > > But this describes fs_reclaim_acquire() - is there some reason we are
> > > > > conflating fs_reclaim with non-sleeping?
> > > > 
> > > > No idea why you tie this into fs_reclaim. We can definitly sleep in there,
> > > > and for e.g. kswapd (which also wraps everything in fs_reclaim) we're
> > > > event supposed to I thought. To make sure we can get at the last bit of
> > > > memory by flushing all the queues and waiting for everything to be cleaned
> > > > out.
> > > 
> > > AFAIK the point of fs_reclaim is to prevent "indirect dependency upon
> > > the page allocator" ie a justification that was given this !blockable
> > > stuff.
> > > 
> > > For instance:
> > > 
> > >   fs_reclaim_acquire()
> > >   kmalloc(GFP_KERNEL) <- lock dep assertion
> > > 
> > > And further, Michal's concern about indirectness through locks is also
> > > handled by lockdep:
> > > 
> > >        CPU0                                 CPU1
> > >                                         mutex_lock()
> > >                                         kmalloc(GFP_KERNEL)
> > >                                         mutex_unlock()
> > >   fs_reclaim_acquire()
> > >   mutex_lock() <- lock dep assertion
> > > 
> > > In other words, to prevent recursion into the page allocator you use
> > > fs_reclaim_acquire(), and lockdep verfies it in its usual robust way.
> > 
> > fs_reclaim_acquire is about FS/IO recursions IIUC. We are talking about
> > any !GFP_NOWAIT allocation context here and any {in}direct dependency on
> > it. 
> 
> AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> __GFP_DIRECT_RECLAIM..
>
> This matches the existing test in __need_fs_reclaim() - so if you are
> OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> allocations during OOM, then I think fs_reclaim already matches what
> you described?

No GFP_NOFS is equally bad. Please read my other email explaining what
the oom_reaper actually requires. In short no blocking on direct or
indirect dependecy on memory allocation that might sleep. If you can
express that in the existing lockdep machinery. All fine. But then
consider deployments where lockdep is no-no because of the overhead.

> > No, non-blocking is a very coarse approximation of what we really need.
> > But it should give us even a stronger condition. Essentially any sleep
> > other than a preemption shouldn't be allowed in that context.
> 
> But it is a nonsense API to give the driver invalidate_range_start,
> the blocking alternative to the non-blocking invalidate_range and then
> demand it to be non-blocking.
> 
> Inspecting the code, no drivers are actually able to progress their
> side in non-blocking mode.
> 
> The best we got was drivers tested the VA range and returned success
> if they had no interest. Which is a big win to be sure, but it looks
> like getting any more is not really posssible.

And that is already a great win! Because many notifiers only do care
about particular mappings. Please note that backing off unconditioanlly
will simply cause that the oom reaper will have to back off not doing
any tear down anything.

> However, we could (probably even should) make the drivers fs_reclaim
> safe.
> 
> If that is enough to guarantee progress of OOM, then lets consider
> something like using current_gfp_context() to force PF_MEMALLOC_NOFS
> allocation behavior on the driver callback and lockdep to try and keep
> pushing on the the debugging, and dropping !blocking.

How are you going to enforce indirect dependency? E.g. a lock that is
also used in other context which depend on sleepable memory allocation
to move forward.

Really, this was aimed to give a very simple debugging aid. If it is
considered to be so controversial, even though I really do not see why,
then let's just drop it on the floor.

Daniel Vetter Aug. 15, 2019, 4:25 p.m. UTC | #10

On Thu, Aug 15, 2019 at 5:10 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:
>
> > You have to wait for the gpu to finnish current processing in
> > invalidate_range_start. Otherwise there's no point to any of this
> > really. So the wait_event/dma_fence_wait are unavoidable really.
>
> I don't envy your task :|
>
> But, what you describe sure sounds like a 'registration cache' model,
> not the 'shadow pte' model of coherency.
>
> The key difference is that a regirstationcache is allowed to become
> incoherent with the VMA's because it holds page pins. It is a
> programming bug in userspace to change VA mappings via mmap/munmap/etc
> while the device is working on that VA, but it does not harm system
> integrity because of the page pin.
>
> The cache ensures that each initiated operation sees a DMA setup that
> matches the current VA map when the operation is initiated and allows
> expensive device DMA setups to be re-used.
>
> A 'shadow pte' model (ie hmm) *really* needs device support to
> directly block DMA access - ie trigger 'device page fault'. ie the
> invalidate_start should inform the device to enter a fault mode and
> that is it.  If the device can't do that, then the driver probably
> shouldn't persue this level of coherency. The driver would quickly get
> into the messy locking problems like dma_fence_wait from a notifier.
>
> It is important to identify what model you are going for as defining a
> 'registration cache' coherence expectation allows the driver to skip
> blocking in invalidate_range_start. All it does is invalidate the
> cache so that future operations pick up the new VA mapping.
>
> Intel's HFI RDMA driver uses this model extensively, and I think it is
> well proven, within some limitations of course.
>
> At least, 'registration cache' is the only use model I know of where
> it is acceptable to skip invalidate_range_end.

I'm not really well versed in the details of our userptr, but both
amdgpu and i915 wait for the gpu to complete from
invalidate_range_start. Jerome has at least looked a lot at the amdgpu
one, so maybe he can explain what exactly it is we're doing ...
-Daniel

Jerome Glisse Aug. 15, 2019, 4:32 p.m. UTC | #11

On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:
> 
> > You have to wait for the gpu to finnish current processing in
> > invalidate_range_start. Otherwise there's no point to any of this
> > really. So the wait_event/dma_fence_wait are unavoidable really.
> 
> I don't envy your task :|
> 
> But, what you describe sure sounds like a 'registration cache' model,
> not the 'shadow pte' model of coherency.
> 
> The key difference is that a regirstationcache is allowed to become
> incoherent with the VMA's because it holds page pins. It is a
> programming bug in userspace to change VA mappings via mmap/munmap/etc
> while the device is working on that VA, but it does not harm system
> integrity because of the page pin.
> 
> The cache ensures that each initiated operation sees a DMA setup that
> matches the current VA map when the operation is initiated and allows
> expensive device DMA setups to be re-used.
> 
> A 'shadow pte' model (ie hmm) *really* needs device support to
> directly block DMA access - ie trigger 'device page fault'. ie the
> invalidate_start should inform the device to enter a fault mode and
> that is it.  If the device can't do that, then the driver probably
> shouldn't persue this level of coherency. The driver would quickly get
> into the messy locking problems like dma_fence_wait from a notifier.

I think here we do not agree on the hardware requirement. For GPU
we will always need to be able to wait for some GPU fence from inside
the notifier callback, there is just no way around that for many of
the GPUs today (i do not see any indication of that changing).

Driver should avoid lock complexity by using wait queue so that the
driver notifier callback can wait without having to hold some driver
lock. However there will be at least one lock needed to update the
internal driver state for the range being invalidated. That lock is
just the device driver page table lock for the GPU page table
associated with the mm_struct. In all GPU driver so far it is a short
lived lock and nothing blocking is done while holding it (it is just
about updating page table directory really wether it is filling it or
clearing it).

> 
> It is important to identify what model you are going for as defining a
> 'registration cache' coherence expectation allows the driver to skip
> blocking in invalidate_range_start. All it does is invalidate the
> cache so that future operations pick up the new VA mapping.
> 
> Intel's HFI RDMA driver uses this model extensively, and I think it is
> well proven, within some limitations of course.
> 
> At least, 'registration cache' is the only use model I know of where
> it is acceptable to skip invalidate_range_end.

Here GPU are not in the registration cache model, i know it might looks
like it because of GUP but GUP was use just because hmm did not exist
at the time.

Cheers,
Jérôme

Jerome Glisse Aug. 15, 2019, 5:11 p.m. UTC | #12

On Thu, Aug 15, 2019 at 01:56:31PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:
> 
> > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > > __GFP_DIRECT_RECLAIM..
> > >
> > > This matches the existing test in __need_fs_reclaim() - so if you are
> > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > > allocations during OOM, then I think fs_reclaim already matches what
> > > you described?
> > 
> > No GFP_NOFS is equally bad. Please read my other email explaining what
> > the oom_reaper actually requires. In short no blocking on direct or
> > indirect dependecy on memory allocation that might sleep.
> 
> It is much easier to follow with some hints on code, so the true
> requirement is that the OOM repear not block on GFP_FS and GFP_IO
> allocations, great, that constraint is now clear.
> 
> > If you can express that in the existing lockdep machinery. All
> > fine. But then consider deployments where lockdep is no-no because
> > of the overhead.
> 
> This is all for driver debugging. The point of lockdep is to find all
> these paths without have to hit them as actual races, using debug
> kernels.
> 
> I don't think we need this kind of debugging on production kernels?
> 
> > > The best we got was drivers tested the VA range and returned success
> > > if they had no interest. Which is a big win to be sure, but it looks
> > > like getting any more is not really posssible.
> > 
> > And that is already a great win! Because many notifiers only do care
> > about particular mappings. Please note that backing off unconditioanlly
> > will simply cause that the oom reaper will have to back off not doing
> > any tear down anything.
> 
> Well, I'm working to propose that we do the VA range test under core
> mmu notifier code that cannot block and then we simply remove the idea
> of blockable from drivers using this new 'range notifier'. 
> 
> I think this pretty much solves the concern?

I am not sure i follow what you propose here ? Like i pointed out in
another email for GPU we do need to be able to sleep (we might get
lucky and not need too but this is runtime thing) within notifier
range_start callback. This has been something allow by notifier since
it has been introduced in the kernel.

Cheers,
Jérôme

Daniel Vetter Aug. 15, 2019, 5:21 p.m. UTC | #13

On Thu, Aug 15, 2019 at 7:16 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Aug 15, 2019 at 12:32:38PM -0400, Jerome Glisse wrote:
> > On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:
> > >
> > > > You have to wait for the gpu to finnish current processing in
> > > > invalidate_range_start. Otherwise there's no point to any of this
> > > > really. So the wait_event/dma_fence_wait are unavoidable really.
> > >
> > > I don't envy your task :|
> > >
> > > But, what you describe sure sounds like a 'registration cache' model,
> > > not the 'shadow pte' model of coherency.
> > >
> > > The key difference is that a regirstationcache is allowed to become
> > > incoherent with the VMA's because it holds page pins. It is a
> > > programming bug in userspace to change VA mappings via mmap/munmap/etc
> > > while the device is working on that VA, but it does not harm system
> > > integrity because of the page pin.
> > >
> > > The cache ensures that each initiated operation sees a DMA setup that
> > > matches the current VA map when the operation is initiated and allows
> > > expensive device DMA setups to be re-used.
> > >
> > > A 'shadow pte' model (ie hmm) *really* needs device support to
> > > directly block DMA access - ie trigger 'device page fault'. ie the
> > > invalidate_start should inform the device to enter a fault mode and
> > > that is it.  If the device can't do that, then the driver probably
> > > shouldn't persue this level of coherency. The driver would quickly get
> > > into the messy locking problems like dma_fence_wait from a notifier.
> >
> > I think here we do not agree on the hardware requirement. For GPU
> > we will always need to be able to wait for some GPU fence from inside
> > the notifier callback, there is just no way around that for many of
> > the GPUs today (i do not see any indication of that changing).
>
> I didn't say you couldn't wait, I was trying to say that the wait
> should only be contigent on the HW itself. Ie you can wait on a GPU
> page table lock, and you can wait on a GPU page table flush completion
> via IRQ.
>
> What is troubling is to wait till some other thread gets a GPU command
> completion and decr's a kref on the DMA buffer - which kinda looks
> like what this dma_fence() stuff is all about. A driver like that
> would have to be super careful to ensure consistent forward progress
> toward dma ref == 0 when the system is under reclaim.
>
> ie by running it's entire IRQ flow under fs_reclaim locking.

This is correct. At least for i915 it's already a required due to our
shrinker also having to do the same. I think amdgpu isn't bothering
with that since they have vram for most of the stuff, and just limit
system memory usage to half of all and forgo the shrinker. Probably
not the nicest approach. Anyway, both do the same mmu_notifier dance,
just want to explain that we've been living with this for longer
already.

So yeah writing a gpu driver is not easy.

> > associated with the mm_struct. In all GPU driver so far it is a short
> > lived lock and nothing blocking is done while holding it (it is just
> > about updating page table directory really wether it is filling it or
> > clearing it).
>
> The main blocking I expect in a shadow PTE flow is waiting for the HW
> to complete invalidations of its PTE cache.
>
> > > It is important to identify what model you are going for as defining a
> > > 'registration cache' coherence expectation allows the driver to skip
> > > blocking in invalidate_range_start. All it does is invalidate the
> > > cache so that future operations pick up the new VA mapping.
> > >
> > > Intel's HFI RDMA driver uses this model extensively, and I think it is
> > > well proven, within some limitations of course.
> > >
> > > At least, 'registration cache' is the only use model I know of where
> > > it is acceptable to skip invalidate_range_end.
> >
> > Here GPU are not in the registration cache model, i know it might looks
> > like it because of GUP but GUP was use just because hmm did not exist
> > at the time.
>
> It is not because of GUP, it is because of the lack of
> invalidate_range_end. A driver cannot correctly implement the SPTE
> model without invalidate_range_end, even if it holds the page pins via
> GUP.
>
> So, I've been assuming the few drivers without invalidate_range_end
> are trying to do registration caching, rather than assuming they are
> broken.

I915 might just be broken. amdgpu does the full thing, using
hmm_mirror. But still with dma_fence_wait.
-Daniel

Jerome Glisse Aug. 15, 2019, 5:35 p.m. UTC | #14

On Thu, Aug 15, 2019 at 07:21:47PM +0200, Daniel Vetter wrote:
> On Thu, Aug 15, 2019 at 7:16 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Thu, Aug 15, 2019 at 12:32:38PM -0400, Jerome Glisse wrote:
> > > On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote:
> > > > On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:
> > > >
> > > > > You have to wait for the gpu to finnish current processing in
> > > > > invalidate_range_start. Otherwise there's no point to any of this
> > > > > really. So the wait_event/dma_fence_wait are unavoidable really.
> > > >
> > > > I don't envy your task :|
> > > >
> > > > But, what you describe sure sounds like a 'registration cache' model,
> > > > not the 'shadow pte' model of coherency.
> > > >
> > > > The key difference is that a regirstationcache is allowed to become
> > > > incoherent with the VMA's because it holds page pins. It is a
> > > > programming bug in userspace to change VA mappings via mmap/munmap/etc
> > > > while the device is working on that VA, but it does not harm system
> > > > integrity because of the page pin.
> > > >
> > > > The cache ensures that each initiated operation sees a DMA setup that
> > > > matches the current VA map when the operation is initiated and allows
> > > > expensive device DMA setups to be re-used.
> > > >
> > > > A 'shadow pte' model (ie hmm) *really* needs device support to
> > > > directly block DMA access - ie trigger 'device page fault'. ie the
> > > > invalidate_start should inform the device to enter a fault mode and
> > > > that is it.  If the device can't do that, then the driver probably
> > > > shouldn't persue this level of coherency. The driver would quickly get
> > > > into the messy locking problems like dma_fence_wait from a notifier.
> > >
> > > I think here we do not agree on the hardware requirement. For GPU
> > > we will always need to be able to wait for some GPU fence from inside
> > > the notifier callback, there is just no way around that for many of
> > > the GPUs today (i do not see any indication of that changing).
> >
> > I didn't say you couldn't wait, I was trying to say that the wait
> > should only be contigent on the HW itself. Ie you can wait on a GPU
> > page table lock, and you can wait on a GPU page table flush completion
> > via IRQ.
> >
> > What is troubling is to wait till some other thread gets a GPU command
> > completion and decr's a kref on the DMA buffer - which kinda looks
> > like what this dma_fence() stuff is all about. A driver like that
> > would have to be super careful to ensure consistent forward progress
> > toward dma ref == 0 when the system is under reclaim.
> >
> > ie by running it's entire IRQ flow under fs_reclaim locking.
> 
> This is correct. At least for i915 it's already a required due to our
> shrinker also having to do the same. I think amdgpu isn't bothering
> with that since they have vram for most of the stuff, and just limit
> system memory usage to half of all and forgo the shrinker. Probably
> not the nicest approach. Anyway, both do the same mmu_notifier dance,
> just want to explain that we've been living with this for longer
> already.
> 
> So yeah writing a gpu driver is not easy.
> 
> > > associated with the mm_struct. In all GPU driver so far it is a short
> > > lived lock and nothing blocking is done while holding it (it is just
> > > about updating page table directory really wether it is filling it or
> > > clearing it).
> >
> > The main blocking I expect in a shadow PTE flow is waiting for the HW
> > to complete invalidations of its PTE cache.
> >
> > > > It is important to identify what model you are going for as defining a
> > > > 'registration cache' coherence expectation allows the driver to skip
> > > > blocking in invalidate_range_start. All it does is invalidate the
> > > > cache so that future operations pick up the new VA mapping.
> > > >
> > > > Intel's HFI RDMA driver uses this model extensively, and I think it is
> > > > well proven, within some limitations of course.
> > > >
> > > > At least, 'registration cache' is the only use model I know of where
> > > > it is acceptable to skip invalidate_range_end.
> > >
> > > Here GPU are not in the registration cache model, i know it might looks
> > > like it because of GUP but GUP was use just because hmm did not exist
> > > at the time.
> >
> > It is not because of GUP, it is because of the lack of
> > invalidate_range_end. A driver cannot correctly implement the SPTE
> > model without invalidate_range_end, even if it holds the page pins via
> > GUP.
> >
> > So, I've been assuming the few drivers without invalidate_range_end
> > are trying to do registration caching, rather than assuming they are
> > broken.
> 
> I915 might just be broken. amdgpu does the full thing, using
> hmm_mirror. But still with dma_fence_wait.

Yeah i915 is broken but it never hurted anyone ;) I posted patch
a long time ago to convert it to hmm but i delayed that to until
i can get through making something of GUPfast that can also be
use for HMM/ODP user.

Cheers,
Jérôme

Jerome Glisse Aug. 15, 2019, 5:39 p.m. UTC | #15

On Thu, Aug 15, 2019 at 02:35:57PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 06:25:16PM +0200, Daniel Vetter wrote:
> 
> > I'm not really well versed in the details of our userptr, but both
> > amdgpu and i915 wait for the gpu to complete from
> > invalidate_range_start. Jerome has at least looked a lot at the amdgpu
> > one, so maybe he can explain what exactly it is we're doing ...
> 
> amdgpu is (wrongly) using hmm for something, I can't really tell what
> it is trying to do. The calls to dma_fence under the
> invalidate_range_start do not give me a good feeling.
> 
> However, i915 shows all the signs of trying to follow the registration
> cache model, it even has a nice comment in
> i915_gem_userptr_get_pages() explaining that the races it has don't
> matter because it is a user space bug to change the VA mapping in the
> first place. That just screams registration cache to me.
> 
> So it is fine to run HW that way, but if you do, there is no reason to
> fence inside the invalidate_range end. Just orphan the DMA buffer and
> clean it up & release the page pins when all DMA buffer refs go to
> zero. The next access to that VA should get a new DMA buffer with the
> right mapping.
> 
> In other words the invalidation should be very simple without
> complicated locking, or wait_event's. Look at hfi1 for example.

This would break the today usage model of uptr and it will
break userspace expectation ie if GPU is writting to that
memory and that memory then the userspace want to make sure
that it will see what the GPU write.

Yes i915 is broken in respect to not having a end notifier
and tracking active invalidation for a range but the GUP
side of thing kind of hide this bug and it shrinks the window
for bad to happen to something so small that i doubt anyone
could ever hit it (still a bug thought).

Cheers,
Jérôme

Michal Hocko Aug. 15, 2019, 5:42 p.m. UTC | #16

On Thu 15-08-19 13:56:31, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:
> 
> > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > > __GFP_DIRECT_RECLAIM..
> > >
> > > This matches the existing test in __need_fs_reclaim() - so if you are
> > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > > allocations during OOM, then I think fs_reclaim already matches what
> > > you described?
> > 
> > No GFP_NOFS is equally bad. Please read my other email explaining what
> > the oom_reaper actually requires. In short no blocking on direct or
> > indirect dependecy on memory allocation that might sleep.
> 
> It is much easier to follow with some hints on code, so the true
> requirement is that the OOM repear not block on GFP_FS and GFP_IO
> allocations, great, that constraint is now clear.

I still do not get why do you put FS/IO into the picture. This is really
about __GFP_DIRECT_RECLAIM.

> 
> > If you can express that in the existing lockdep machinery. All
> > fine. But then consider deployments where lockdep is no-no because
> > of the overhead.
> 
> This is all for driver debugging. The point of lockdep is to find all
> these paths without have to hit them as actual races, using debug
> kernels.
> 
> I don't think we need this kind of debugging on production kernels?

Again, the primary motivation was a simple debugging aid that could be
used without worrying about overhead. So lockdep is very often out of
the question.

> > > The best we got was drivers tested the VA range and returned success
> > > if they had no interest. Which is a big win to be sure, but it looks
> > > like getting any more is not really posssible.
> > 
> > And that is already a great win! Because many notifiers only do care
> > about particular mappings. Please note that backing off unconditioanlly
> > will simply cause that the oom reaper will have to back off not doing
> > any tear down anything.
> 
> Well, I'm working to propose that we do the VA range test under core
> mmu notifier code that cannot block and then we simply remove the idea
> of blockable from drivers using this new 'range notifier'. 
> 
> I think this pretty much solves the concern?

Well, my idea was that a range check and early bail out was a first step
and then each specific notifier would be able to do a more specific
check. I was not able to do the second step because that requires a deep
understanding of the respective subsystem.

Really all I do care about is to reclaim as much memory from the
oom_reaper context as possible. And that cannot really be an unbounded
process. Quite contrary it should be as swift as possible. From my
cursory look some notifiers are able to achieve their task without
blocking or depending on memory just fine. So bailing out
unconditionally on the range of interest would just put us back.

> > > However, we could (probably even should) make the drivers fs_reclaim
> > > safe.
> > > 
> > > If that is enough to guarantee progress of OOM, then lets consider
> > > something like using current_gfp_context() to force PF_MEMALLOC_NOFS
> > > allocation behavior on the driver callback and lockdep to try and keep
> > > pushing on the the debugging, and dropping !blocking.
> > 
> > How are you going to enforce indirect dependency? E.g. a lock that is
> > also used in other context which depend on sleepable memory allocation
> > to move forward.
> 
> You mean like this:
> 
>        CPU0                                 CPU1
>                                         mutex_lock()
>                                         kmalloc(GFP_KERNEL)

no I mean __GFP_DIRECT_RECLAIM here.

>                                         mutex_unlock()
>   fs_reclaim_acquire()
>   mutex_lock() <- illegal: lock dep assertion

I cannot really comment on how that is achieveable by lockdep. I managed
to forget details about FS/IO reclaim recursion tracking already and I
do not have time to learn it again. It was quite a hack. Anyway, let me
repeat that the primary motivation was a simple aid. Not something as
poverful as lockdep.

Jerome Glisse Aug. 15, 2019, 5:57 p.m. UTC | #17

On Thu, Aug 15, 2019 at 07:42:07PM +0200, Michal Hocko wrote:
> On Thu 15-08-19 13:56:31, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:
> > 
> > > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > > > __GFP_DIRECT_RECLAIM..
> > > >
> > > > This matches the existing test in __need_fs_reclaim() - so if you are
> > > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > > > allocations during OOM, then I think fs_reclaim already matches what
> > > > you described?
> > > 
> > > No GFP_NOFS is equally bad. Please read my other email explaining what
> > > the oom_reaper actually requires. In short no blocking on direct or
> > > indirect dependecy on memory allocation that might sleep.
> > 
> > It is much easier to follow with some hints on code, so the true
> > requirement is that the OOM repear not block on GFP_FS and GFP_IO
> > allocations, great, that constraint is now clear.
> 
> I still do not get why do you put FS/IO into the picture. This is really
> about __GFP_DIRECT_RECLAIM.
> 
> > 
> > > If you can express that in the existing lockdep machinery. All
> > > fine. But then consider deployments where lockdep is no-no because
> > > of the overhead.
> > 
> > This is all for driver debugging. The point of lockdep is to find all
> > these paths without have to hit them as actual races, using debug
> > kernels.
> > 
> > I don't think we need this kind of debugging on production kernels?
> 
> Again, the primary motivation was a simple debugging aid that could be
> used without worrying about overhead. So lockdep is very often out of
> the question.
> 
> > > > The best we got was drivers tested the VA range and returned success
> > > > if they had no interest. Which is a big win to be sure, but it looks
> > > > like getting any more is not really posssible.
> > > 
> > > And that is already a great win! Because many notifiers only do care
> > > about particular mappings. Please note that backing off unconditioanlly
> > > will simply cause that the oom reaper will have to back off not doing
> > > any tear down anything.
> > 
> > Well, I'm working to propose that we do the VA range test under core
> > mmu notifier code that cannot block and then we simply remove the idea
> > of blockable from drivers using this new 'range notifier'. 
> > 
> > I think this pretty much solves the concern?
> 
> Well, my idea was that a range check and early bail out was a first step
> and then each specific notifier would be able to do a more specific
> check. I was not able to do the second step because that requires a deep
> understanding of the respective subsystem.
> 
> Really all I do care about is to reclaim as much memory from the
> oom_reaper context as possible. And that cannot really be an unbounded
> process. Quite contrary it should be as swift as possible. From my
> cursory look some notifiers are able to achieve their task without
> blocking or depending on memory just fine. So bailing out
> unconditionally on the range of interest would just put us back.

Agree, OOM just asking the question: can i unmap that page quickly ?
so that me (OOM) can swap it out. In many cases the driver need to
lookup something to see if at the time the memory is just not in use
and can be reclaim right away. So driver should have a path to
optimistically update its state to allow quick reclaim.


> > > > However, we could (probably even should) make the drivers fs_reclaim
> > > > safe.
> > > > 
> > > > If that is enough to guarantee progress of OOM, then lets consider
> > > > something like using current_gfp_context() to force PF_MEMALLOC_NOFS
> > > > allocation behavior on the driver callback and lockdep to try and keep
> > > > pushing on the the debugging, and dropping !blocking.
> > > 
> > > How are you going to enforce indirect dependency? E.g. a lock that is
> > > also used in other context which depend on sleepable memory allocation
> > > to move forward.
> > 
> > You mean like this:
> > 
> >        CPU0                                 CPU1
> >                                         mutex_lock()
> >                                         kmalloc(GFP_KERNEL)
> 
> no I mean __GFP_DIRECT_RECLAIM here.
> 
> >                                         mutex_unlock()
> >   fs_reclaim_acquire()
> >   mutex_lock() <- illegal: lock dep assertion
> 
> I cannot really comment on how that is achieveable by lockdep. I managed
> to forget details about FS/IO reclaim recursion tracking already and I
> do not have time to learn it again. It was quite a hack. Anyway, let me
> repeat that the primary motivation was a simple aid. Not something as
> poverful as lockdep.

I feel that the fs_reclaim_acquire() is just too heavy weight here. I
do think that Daniel patches improve the debugging situation without
burdening anything so i am in favor or merging that.

I do not think we should devote too much time into fs_reclaim(), our
time would be better spend in improving the driver shrinker for instance
after all OOM is all about trying to free-up memory.

Cheers,
Jérôme

Jerome Glisse Aug. 15, 2019, 6:27 p.m. UTC | #18

On Thu, Aug 15, 2019 at 03:01:59PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 01:39:22PM -0400, Jerome Glisse wrote:
> > On Thu, Aug 15, 2019 at 02:35:57PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Aug 15, 2019 at 06:25:16PM +0200, Daniel Vetter wrote:
> > > 
> > > > I'm not really well versed in the details of our userptr, but both
> > > > amdgpu and i915 wait for the gpu to complete from
> > > > invalidate_range_start. Jerome has at least looked a lot at the amdgpu
> > > > one, so maybe he can explain what exactly it is we're doing ...
> > > 
> > > amdgpu is (wrongly) using hmm for something, I can't really tell what
> > > it is trying to do. The calls to dma_fence under the
> > > invalidate_range_start do not give me a good feeling.
> > > 
> > > However, i915 shows all the signs of trying to follow the registration
> > > cache model, it even has a nice comment in
> > > i915_gem_userptr_get_pages() explaining that the races it has don't
> > > matter because it is a user space bug to change the VA mapping in the
> > > first place. That just screams registration cache to me.
> > > 
> > > So it is fine to run HW that way, but if you do, there is no reason to
> > > fence inside the invalidate_range end. Just orphan the DMA buffer and
> > > clean it up & release the page pins when all DMA buffer refs go to
> > > zero. The next access to that VA should get a new DMA buffer with the
> > > right mapping.
> > > 
> > > In other words the invalidation should be very simple without
> > > complicated locking, or wait_event's. Look at hfi1 for example.
> > 
> > This would break the today usage model of uptr and it will
> > break userspace expectation ie if GPU is writting to that
> > memory and that memory then the userspace want to make sure
> > that it will see what the GPU write.
> 
> How exactly? This is holding the page pin, so the only way the VA
> mapping can be changed is via explicit user action.
> 
> ie:
> 
>    gpu_write_something(va, size)
>    mmap(.., va, size, MMAP_FIXED);
>    gpu_wait_done()
> 
> This is racy and indeterminate with both models.
> 
> Based on the comment in i915 it appears to be going on the model that
> changes to the mmap by userspace when the GPU is working on it is a
> programming bug. This is reasonable, lots of systems use this kind of
> consistency model.

Well userspace process doing munmap(), mremap(), fork() and things like
that are a bug from the i915 kernel and userspace contract point of view.

But things like migration or reclaim are not cover under that contract
and for those the expectation is that CPU access to the same virtual address
should allow to get what was last written to it either by the GPU or the
CPU.

> 
> Since the driver seems to rely on a dma_fence to block DMA access, it
> looks to me like the kernel has full visibility to the
> 'gpu_write_something()' and 'gpu_wait_done()' markers.

So let's only consider the case where GPU wants to write to the memory
(the read only case is obviously right and not need any notifier in
fact) and like above the only thing we care about is reclaim or migration
(for instance because of some numa compaction) as the rest is cover by
i915 userspace contract.

So in the write case we do GUPfast(write=true) which will be "safe" from
any concurrent CPU page table update ie if GUPfast get a reference for
the page then any racing CPU page table update will not be able to migrate
or reclaim the page and thus the virtual address to page association will
be preserve.

This is only true because of GUPfast(), now if GUPfast() fails it will
fallback to the slow GUP case which make the same thing safe by taking
the page table lock.

Because of the reference on the page the i915 driver can forego the mmu
notifier end callback. The thing here is that taking a page reference
is pointless if we have better synchronization and tracking of mmu
notifier. Hence converting to hmm mirror allows to avoid taking a ref
on the page while still keeping the same functionality as of today.

> I think trying to use hmm_range_fault on HW that can't do HW page
> faulting and HW 'TLB shootdown' is a very, very bad idea. I fear that
> is what amd gpu is trying to do.
> 
> I haven't yet seen anything that looks like 'TLB shootdown' in i915??

GPU driver have complex usage pattern the tlb shootdown is implicit
once the GEM object associated with the uptr is invalidated it means
next time userspace submit command against that GEM object it will
have to re-validate it which means re-program the GPU page table to
point to the proper address (and re-call GUP).

So the whole GPU page table update is all hidden behind GEM function
like i915_gem_object_unbind() (or ttm invalidate for amd/radeon).

Technicaly those GPU do not support page faulting but because of the
way GPU works you know that you have frequent checkpoint where you
can unbind things. This is what is happening here.

Also not all GPU engines can handle page fault, this is true of all
GPU with page fault AFAIK (AMD and NVidia so far). This is why
uptr is a limited form of SVM (share virtual memory) that can be
use on all GPUs engine including the dma engine. When using the
full SVM power (like hmm mirror with nouveau) this is only use in
GPU engine that supports page fault (but updating the page table
still require locking and waiting on acknowledge).

Cheers,
Jérôme

Michal Hocko Aug. 15, 2019, 7:05 p.m. UTC | #19

On Thu 15-08-19 15:24:48, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 07:42:07PM +0200, Michal Hocko wrote:
> > On Thu 15-08-19 13:56:31, Jason Gunthorpe wrote:
> > > On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:
> > > 
> > > > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > > > > __GFP_DIRECT_RECLAIM..
> > > > >
> > > > > This matches the existing test in __need_fs_reclaim() - so if you are
> > > > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > > > > allocations during OOM, then I think fs_reclaim already matches what
> > > > > you described?
> > > > 
> > > > No GFP_NOFS is equally bad. Please read my other email explaining what
> > > > the oom_reaper actually requires. In short no blocking on direct or
> > > > indirect dependecy on memory allocation that might sleep.
> > > 
> > > It is much easier to follow with some hints on code, so the true
> > > requirement is that the OOM repear not block on GFP_FS and GFP_IO
> > > allocations, great, that constraint is now clear.
> > 
> > I still do not get why do you put FS/IO into the picture. This is really
> > about __GFP_DIRECT_RECLAIM.
> 
> Like I said this is complicated, translating "no blocking on direct or
> indirect dependecy on memory allocation that might sleep" into GFP
> flags is hard for us outside the mm community.
> 
> So the contraint here is no __GFP_DIRECT_RECLAIM?

OK, I am obviously failing to explain that. Sorry about that. You are
right that this is not simple. Let me try again.

The context we are calling !blockable notifiers from has to finish in a
_finite_ amount of time (and swift is hugely appreciated by users of
otherwise non-responsive system that is under OOM). We are out of memory
so we cannot be blocked waiting for memory. Directly or indirectly (via
a lock, waiting for an event that needs memory to finish in general). So
you need to track dependency over more complicated contexts than the
direct call path (think of workqueue for example).

> I bring up FS/IO because that is what Tejun mentioned when I asked him
> about reclaim restrictions, and is what fs_reclaim_acquire() is
> already sensitive too. It is pretty confusing if we have places using
> the word 'reclaim' with different restrictions. :(

fs_reclaim has been invented to catch potential deadlocks when a
GFP_NO{FS/IO} allocation hits into fs/io reclaim. This is a subset of
the reclaim that we have. The oom context is more restricted and it
cannot depend on _any_ memory reclaim because by the time we have got to
this context all the reclaim has already failed and wait some more will
simply not help.

> > >        CPU0                                 CPU1
> > >                                         mutex_lock()
> > >                                         kmalloc(GFP_KERNEL)
> > 
> > no I mean __GFP_DIRECT_RECLAIM here.
> > 
> > >                                         mutex_unlock()
> > >   fs_reclaim_acquire()
> > >   mutex_lock() <- illegal: lock dep assertion
> > 
> > I cannot really comment on how that is achieveable by lockdep.
> 
> ??? I am trying to explain this is already done and working today. The
> above example will already fault with lockdep enabled.
> 
> This is existing debugging we can use and improve upon rather that
> invent new debugging.

This is what you claim and I am saying that fs_reclaim is about a
restricted reclaim context and it is an ugly hack. It has proven to
report false positives. Maybe it can be extended to a generic reclaim.
I haven't tried that. Do not aim to try it.

Michal Hocko Aug. 15, 2019, 7:35 p.m. UTC | #20

On Thu 15-08-19 16:18:10, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 09:05:25PM +0200, Michal Hocko wrote:
> 
> > This is what you claim and I am saying that fs_reclaim is about a
> > restricted reclaim context and it is an ugly hack. It has proven to
> > report false positives. Maybe it can be extended to a generic reclaim.
> > I haven't tried that. Do not aim to try it.
> 
> Okay, great, I think this has been very helpful, at least for me,
> thanks. I did not know fs_reclaim was so problematic, or the special
> cases about OOM 'reclaim'. 

I am happy that this is more clear now.

> On this patch, I have no general objection to enforcing drivers to be
> non-blocking, I'd just like to see it done with the existing lockdep
> can't sleep detection rather than inventing some new debugging for it.
> 
> I understand this means the debugging requires lockdep enabled and
> will not run in production, but I'm of the view that is OK and in line
> with general kernel practice.

Yes and I do agree with this in general.

> The last detail is I'm still unclear what a GFP flags a blockable
> invalidate_range_start() should use. Is GFP_KERNEL OK?

I hope I will not make this muddy again ;)
invalidate_range_start in the blockable mode can use/depend on any sleepable
allocation allowed in the context it is called from. So in other words
it is no different from any other function in the kernel that calls into
allocator. As the API is missing gfp context then I hope it is not
called from any restricted contexts (except from the oom which we have
!blockable for).

> Lockdep has
> complained on that in past due to fs_reclaim - how do you know if it
> is a false positive?

I would have to see the specific lockdep splat.

Daniel Vetter Aug. 15, 2019, 8:16 p.m. UTC | #21

On Thu, Aug 15, 2019 at 9:35 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 15-08-19 16:18:10, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 09:05:25PM +0200, Michal Hocko wrote:
> >
> > > This is what you claim and I am saying that fs_reclaim is about a
> > > restricted reclaim context and it is an ugly hack. It has proven to
> > > report false positives. Maybe it can be extended to a generic reclaim.
> > > I haven't tried that. Do not aim to try it.
> >
> > Okay, great, I think this has been very helpful, at least for me,
> > thanks. I did not know fs_reclaim was so problematic, or the special
> > cases about OOM 'reclaim'.
>
> I am happy that this is more clear now.
>
> > On this patch, I have no general objection to enforcing drivers to be
> > non-blocking, I'd just like to see it done with the existing lockdep
> > can't sleep detection rather than inventing some new debugging for it.
> >
> > I understand this means the debugging requires lockdep enabled and
> > will not run in production, but I'm of the view that is OK and in line
> > with general kernel practice.
>
> Yes and I do agree with this in general.

So if someone can explain to me how that works with lockdep I can of
course implement it. But afaics that doesn't exist (I tried to explain
that somewhere else already), and I'm no really looking forward to
hacking also on lockdep for this little series.

> > The last detail is I'm still unclear what a GFP flags a blockable
> > invalidate_range_start() should use. Is GFP_KERNEL OK?
>
> I hope I will not make this muddy again ;)
> invalidate_range_start in the blockable mode can use/depend on any sleepable
> allocation allowed in the context it is called from. So in other words
> it is no different from any other function in the kernel that calls into
> allocator. As the API is missing gfp context then I hope it is not
> called from any restricted contexts (except from the oom which we have
> !blockable for).

Hm, that's new to me. I thought mmu notifiers very much can be called
from direct reclaim paths, so you have to be extremely careful with
getting back into that one. At least the lockdep splats I remember
also tend to have fs_reclaim in there, that's where all the fun comes
from.

> > Lockdep has
> > complained on that in past due to fs_reclaim - how do you know if it
> > is a false positive?
>
> I would have to see the specific lockdep splat.

I guess the lockdep annotation for invalidate_range_start carries the
same risks as the fs_reclaim annotation. Still feels like worth it.
-Daniel

Daniel Vetter Aug. 15, 2019, 8:49 p.m. UTC | #22

On Thu, Aug 15, 2019 at 10:27 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote:
> > So if someone can explain to me how that works with lockdep I can of
> > course implement it. But afaics that doesn't exist (I tried to explain
> > that somewhere else already), and I'm no really looking forward to
> > hacking also on lockdep for this little series.
>
> Hmm, kind of looks like it is done by calling preempt_disable()

Yup. That was v1, then came the suggestion that disabling preemption
is maybe not the best thing (the oom reaper could still run for a long
time comparatively, if it's cleaning out gigabytes of process memory
or what not, hence this dedicated debug infrastructure).

> Probably the debug option is CONFIG_DEBUG_PREEMPT, not lockdep?

CONFIG_DEBUG_ATOMIC_SLEEP. Like in my patch.
-Daniel

Andrew Morton Aug. 15, 2019, 10:15 p.m. UTC | #23

On Thu, 15 Aug 2019 10:44:29 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> > I continue to struggle with this.  It introduces a new kernel state
> > "running preemptibly but must not call schedule()".  How does this make
> > any sense?
> > 
> > Perhaps a much, much more detailed description of the oom_reaper
> > situation would help out.
>  
> The primary point here is that there is a demand of non blockable mmu
> notifiers to be called when the oom reaper tears down the address space.
> As the oom reaper is the primary guarantee of the oom handling forward
> progress it cannot be blocked on anything that might depend on blockable
> memory allocations. These are not really easy to track because they
> might be indirect - e.g. notifier blocks on a lock which other context
> holds while allocating memory or waiting for a flusher that needs memory
> to perform its work. If such a blocking state happens that we can end up
> in a silent hang with an unusable machine.
> Now we hope for reasonable implementations of mmu notifiers (strong
> words I know ;) and this should be relatively simple and effective catch
> all tool to detect something suspicious is going on.
> 
> Does that make the situation more clear?

Yes, thanks, much.  Maybe a code comment along the lines of

  This is on behalf of the oom reaper, specifically when it is
  calling the mmu notifiers.  The problem is that if the notifier were
  to block on, for example, mutex_lock() and if the process which holds
  that mutex were to perform a sleeping memory allocation, the oom
  reaper is now blocked on completion of that memory allocation.

btw, do we need task_struct.non_block_count?  Perhaps the oom reaper
thread could set a new PF_NONBLOCK (which would be more general than
PF_OOM_REAPER).  If we run out of PF_ flags, use (current == oom_reaper_th).

Daniel Vetter Aug. 16, 2019, 6:20 a.m. UTC | #24

On Fri, Aug 16, 2019 at 3:00 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Thu, Aug 15, 2019 at 10:49:31PM +0200, Daniel Vetter wrote:
> > On Thu, Aug 15, 2019 at 10:27 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote:
> > > > So if someone can explain to me how that works with lockdep I can of
> > > > course implement it. But afaics that doesn't exist (I tried to explain
> > > > that somewhere else already), and I'm no really looking forward to
> > > > hacking also on lockdep for this little series.
> > >
> > > Hmm, kind of looks like it is done by calling preempt_disable()
> >
> > Yup. That was v1, then came the suggestion that disabling preemption
> > is maybe not the best thing (the oom reaper could still run for a long
> > time comparatively, if it's cleaning out gigabytes of process memory
> > or what not, hence this dedicated debug infrastructure).
>
> Oh, I'm coming in late, sorry
>
> Anyhow, I was thinking since we agreed this can trigger on some
> CONFIG_DEBUG flag, something like
>
>     /* This is a sleepable region, but use preempt_disable to get debugging
>      * for calls that are not allowed to block for OOM [.. insert
>      * Michal's explanation.. ] */
>     if (IS_ENABLED(CONFIG_DEBUG_ATOMIC_SLEEP) && !mmu_notifier_range_blockable(range))
>         preempt_disable();
>     ops->invalidate_range_start();

I think we also discussed that, and some expressed concerns it would
change behaviour/timing too much for testing. Since this does does
disable preemption for real, not just for might_sleep.

> And I have also been idly mulling doing something like
>
>    if (IS_ENABLED(CONFIG_DEBUG_NOTIFIERS) &&
>        rand &&
>        mmu_notifier_range_blockable(range)) {
>      range->flags = 0
>      if (!ops->invalidate_range_start(range))
>         continue
>
>      // Failed, try again as blockable
>      range->flags = MMU_NOTIFIER_RANGE_BLOCKABLE
>    }
>    ops->invalidate_range_start(range);
>
> Which would give coverage for this corner case without forcing OOM.

Hm, this sounds like a neat idea to slap on top. The rand is going to
be a bit tricky though, but I guess for this we could stuff another
counter into task_struct and just e.g. do this every 1000th or so
invalidate (well need to pick a prime so we cycle through notifiers in
case there's multiple). I like.

Michal, thoughts?
-Daniel

Michal Hocko Aug. 16, 2019, 8:10 a.m. UTC | #25

On Thu 15-08-19 17:13:23, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote:
> 
> > > The last detail is I'm still unclear what a GFP flags a blockable
> > > invalidate_range_start() should use. Is GFP_KERNEL OK?
> > 
> > I hope I will not make this muddy again ;)
> > invalidate_range_start in the blockable mode can use/depend on any sleepable
> > allocation allowed in the context it is called from. 
> 
> 'in the context is is called from' is the magic phrase, as
> invalidate_range_start is called while holding several different mm
> related locks. I know at least write mmap_sem and i_mmap_rwsem
> (write?)
> 
> Can GFP_KERNEL be called while holding those locks?

i_mmap_rwsem would be problematic because it is taken during the
reclaim.

> This is the question of indirect dependency on reclaim via locks you
> raised earlier.
> 
> > So in other words it is no different from any other function in the
> > kernel that calls into allocator. As the API is missing gfp context
> > then I hope it is not called from any restricted contexts (except
> > from the oom which we have !blockable for).
> 
> Yes, the callers are exactly my concern.
>  
> > > Lockdep has
> > > complained on that in past due to fs_reclaim - how do you know if it
> > > is a false positive?
> > 
> > I would have to see the specific lockdep splat.
> 
> See below. I found it when trying to understand why the registration
> of the mmu notififer was so oddly coded.
> 
> The situation was:
> 
>   down_write(&mm->mmap_sem);
>   mm_take_all_locks(mm);
>   kmalloc(GFP_KERNEL);  <--- lockdep warning

Ugh. mm_take_all_locks :/

> I understood Daniel said he saw this directly on a recent kernel when
> working with his lockdep patch?
> 
> Checking myself, on todays kernel I see a call chain:
> 
> shrink_all_memory
>   fs_reclaim_acquire(sc.gfp_mask);
>   [..]
>   do_try_to_free_pages
>    shrink_zones
>     shrink_node
>      shrink_node_memcg
>       shrink_list
>        shrink_active_list
>         page_referenced
>          rmap_walk
>           rmap_walk_file
>            i_mmap_lock_read
>             down_read(i_mmap_rwsem)
> 
> So it is possible that the down_read() above will block on
> i_mmap_rwsem being held in the caller of invalidate_range_start which
> is doing kmalloc(GPF_KERNEL).
> 
> Is this OK? The lockdep annotation says no..

It's not as per the above code patch which is easily possible because
mm_take_all_locks will lock all file vmas.

Michal Hocko Aug. 16, 2019, 8:24 a.m. UTC | #26

On Thu 15-08-19 15:15:09, Andrew Morton wrote:
> On Thu, 15 Aug 2019 10:44:29 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > > I continue to struggle with this.  It introduces a new kernel state
> > > "running preemptibly but must not call schedule()".  How does this make
> > > any sense?
> > > 
> > > Perhaps a much, much more detailed description of the oom_reaper
> > > situation would help out.
> >  
> > The primary point here is that there is a demand of non blockable mmu
> > notifiers to be called when the oom reaper tears down the address space.
> > As the oom reaper is the primary guarantee of the oom handling forward
> > progress it cannot be blocked on anything that might depend on blockable
> > memory allocations. These are not really easy to track because they
> > might be indirect - e.g. notifier blocks on a lock which other context
> > holds while allocating memory or waiting for a flusher that needs memory
> > to perform its work. If such a blocking state happens that we can end up
> > in a silent hang with an unusable machine.
> > Now we hope for reasonable implementations of mmu notifiers (strong
> > words I know ;) and this should be relatively simple and effective catch
> > all tool to detect something suspicious is going on.
> > 
> > Does that make the situation more clear?
> 
> Yes, thanks, much.  Maybe a code comment along the lines of
> 
>   This is on behalf of the oom reaper, specifically when it is
>   calling the mmu notifiers.  The problem is that if the notifier were
>   to block on, for example, mutex_lock() and if the process which holds
>   that mutex were to perform a sleeping memory allocation, the oom
>   reaper is now blocked on completion of that memory allocation.

    reaper is now blocked on completion of that memory allocation
    and cannot provide the guarantee of the OOM forward progress.

OK. 
 
> btw, do we need task_struct.non_block_count?  Perhaps the oom reaper
> thread could set a new PF_NONBLOCK (which would be more general than
> PF_OOM_REAPER).  If we run out of PF_ flags, use (current == oom_reaper_th).

Well, I do not have a strong opinion here. A simple check for the value
seems to be trivial. There are quite some holes in task_struct to hide
this counter without increasing the size.

Michal Hocko Aug. 16, 2019, 8:27 a.m. UTC | #27

On Thu 15-08-19 22:16:43, Daniel Vetter wrote:
> On Thu, Aug 15, 2019 at 9:35 PM Michal Hocko <mhocko@kernel.org> wrote:
[...]
> > > The last detail is I'm still unclear what a GFP flags a blockable
> > > invalidate_range_start() should use. Is GFP_KERNEL OK?
> >
> > I hope I will not make this muddy again ;)
> > invalidate_range_start in the blockable mode can use/depend on any sleepable
> > allocation allowed in the context it is called from. So in other words
> > it is no different from any other function in the kernel that calls into
> > allocator. As the API is missing gfp context then I hope it is not
> > called from any restricted contexts (except from the oom which we have
> > !blockable for).
> 
> Hm, that's new to me. I thought mmu notifiers very much can be called
> from direct reclaim paths, so you have to be extremely careful with
> getting back into that one.

Correct, I should have added that notifier callbacks ideally do not
allocate any memory. They can block and even that is quite a pain to be
honest.

Michal Hocko Aug. 16, 2019, 12:26 p.m. UTC | #28

On Fri 16-08-19 09:19:06, Jason Gunthorpe wrote:
> On Fri, Aug 16, 2019 at 10:10:29AM +0200, Michal Hocko wrote:
> > On Thu 15-08-19 17:13:23, Jason Gunthorpe wrote:
> > > On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote:
> > > 
> > > > > The last detail is I'm still unclear what a GFP flags a blockable
> > > > > invalidate_range_start() should use. Is GFP_KERNEL OK?
> > > > 
> > > > I hope I will not make this muddy again ;)
> > > > invalidate_range_start in the blockable mode can use/depend on any sleepable
> > > > allocation allowed in the context it is called from. 
> > > 
> > > 'in the context is is called from' is the magic phrase, as
> > > invalidate_range_start is called while holding several different mm
> > > related locks. I know at least write mmap_sem and i_mmap_rwsem
> > > (write?)
> > > 
> > > Can GFP_KERNEL be called while holding those locks?
> > 
> > i_mmap_rwsem would be problematic because it is taken during the
> > reclaim.
> 
> Okay.. So the fs_reclaim debugging does catch errors.

I do not think fs_reclaim is the udnerlying mechanism to catch this
deadlock. It is a simple AA deadlock. You take i_mmap_rwsem and then
go down the allocation path, direct reclaim and take the lock again.
Nothing really surprising. fs_reclaim is really to catch GFP_NOFS
context calling into a less restricted (e.g. GFP_KERNEL allocation
context).

> Do you have any
> reference for what a false positive looks like? 

I believe I have given some examples when introducing __GFP_NOLOCKDEP.

> I would like to inject it into the notifier path as this is very
> difficult for driver authors to discover and know about, but I'm
> worried about your false positive remark.
> 
> I think I understand we can use only GFP_ATOMIC in the notifiers, but
> we need a strategy to handle OOM to guarentee forward progress.

Your example is from the notifier registration IIUC. Can you
pre-allocate before taking locks? Could you point me to some examples
when the allocation is necessary in the range notifier callback?

Daniel Vetter Aug. 16, 2019, 2:11 p.m. UTC | #29

On Fri, Aug 16, 2019 at 2:12 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Aug 16, 2019 at 08:20:55AM +0200, Daniel Vetter wrote:
> > On Fri, Aug 16, 2019 at 3:00 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > On Thu, Aug 15, 2019 at 10:49:31PM +0200, Daniel Vetter wrote:
> > > > On Thu, Aug 15, 2019 at 10:27 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > > On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote:
> > > > > > So if someone can explain to me how that works with lockdep I can of
> > > > > > course implement it. But afaics that doesn't exist (I tried to explain
> > > > > > that somewhere else already), and I'm no really looking forward to
> > > > > > hacking also on lockdep for this little series.
> > > > >
> > > > > Hmm, kind of looks like it is done by calling preempt_disable()
> > > >
> > > > Yup. That was v1, then came the suggestion that disabling preemption
> > > > is maybe not the best thing (the oom reaper could still run for a long
> > > > time comparatively, if it's cleaning out gigabytes of process memory
> > > > or what not, hence this dedicated debug infrastructure).
> > >
> > > Oh, I'm coming in late, sorry
> > >
> > > Anyhow, I was thinking since we agreed this can trigger on some
> > > CONFIG_DEBUG flag, something like
> > >
> > >     /* This is a sleepable region, but use preempt_disable to get debugging
> > >      * for calls that are not allowed to block for OOM [.. insert
> > >      * Michal's explanation.. ] */
> > >     if (IS_ENABLED(CONFIG_DEBUG_ATOMIC_SLEEP) && !mmu_notifier_range_blockable(range))
> > >         preempt_disable();
> > >     ops->invalidate_range_start();
> >
> > I think we also discussed that, and some expressed concerns it would
> > change behaviour/timing too much for testing. Since this does does
> > disable preemption for real, not just for might_sleep.
>
> I don't follow, this is a debug kernel, it will have widly different
> timing.
>
> Further the point of this debugging on atomic_sleep is to be as
> timing-independent as possible since functions with rare sleeps should
> be guarded by might_sleep() in their common paths.
>
> I guess I don't get the push to have some low overhead debugging for
> this? Is there something special you are looking for?

Don't ask me, I'm just trying to get _some_ debugging for this in. I
don't care one bit how much overhead it has because in our CI our
debug build has lockdep and everything and it crawls anyway. I started
out with the preempt_disable/enable thing like you suggested, and a
few rounds later we're here. We can go back to v1 on this one, but I'd
prefer to not do the lap too often.

Also, aside from this patch (which is prep for the next) and some
simple reordering conflicts they're all independent. So if there's no
way to paint this bikeshed here (technicolor perhaps?) then I'd like
to get at least the others considered.
-Daniel

Jerome Glisse Aug. 16, 2019, 3:05 p.m. UTC | #30

On Fri, Aug 16, 2019 at 11:31:45AM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 16, 2019 at 02:26:25PM +0200, Michal Hocko wrote:
> > On Fri 16-08-19 09:19:06, Jason Gunthorpe wrote:
> > > On Fri, Aug 16, 2019 at 10:10:29AM +0200, Michal Hocko wrote:
> > > > On Thu 15-08-19 17:13:23, Jason Gunthorpe wrote:
> > > > > On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote:

[...]

> > > I would like to inject it into the notifier path as this is very
> > > difficult for driver authors to discover and know about, but I'm
> > > worried about your false positive remark.
> > > 
> > > I think I understand we can use only GFP_ATOMIC in the notifiers, but
> > > we need a strategy to handle OOM to guarentee forward progress.
> > 
> > Your example is from the notifier registration IIUC. 
> 
> Yes, that is where this commit hit it.. Triggering this under an
> actual notifier to get a lockdep report is hard.
> 
> > Can you pre-allocate before taking locks? Could you point me to some
> > examples when the allocation is necessary in the range notifier
> > callback?
> 
> Hmm. I took a careful look, I only found mlx5 as obviously allocating
> memory:
> 
>  mlx5_ib_invalidate_range()
>   mlx5_ib_update_xlt()
>    __get_free_pages(gfp, get_order(size));
> 
> However, I think this could be changed to fall back to some small
> buffer if allocation fails. The existing scheme looks sketchy
> 
> nouveau does:
> 
>  nouveau_svmm_invalidate
>   nvif_object_mthd
>    kmalloc(GFP_KERNEL)
> 
> But I think it reliably uses a stack buffer here
> 
> i915 I think Daniel said he audited.
> 
> amd_mn.. The actual invalidate_range_start does not allocate memory,
> but it is entangled with so many locks it would need careful analysis
> to be sure.
> 
> The others look generally OK, which is good, better than I hoped :)

It is on my TODO list to get rid of allocation in notifier callback
(iirc nouveau already use the stack unless it was lost in all the
revision it wants through). Anyway i do not think we need allocation
in notifier.

Cheers,
Jérôme

Daniel Vetter Aug. 16, 2019, 4:36 p.m. UTC | #31

On Fri, Aug 16, 2019 at 4:38 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Aug 16, 2019 at 04:11:34PM +0200, Daniel Vetter wrote:
> > Also, aside from this patch (which is prep for the next) and some
> > simple reordering conflicts they're all independent. So if there's no
> > way to paint this bikeshed here (technicolor perhaps?) then I'd like
> > to get at least the others considered.
>
> Sure, I think for conflict avoidance reasons I'm probably taking
> mmu_notifier stuff via hmm.git, so:
>
> - Andrew had a minor remark on #1, I am ambivalent and would take it
>   as-is. Your decision if you want to respin.

I like mine better, see also the reply from Ralph Campbell.

> - #2/#3 is this issue, I would stand by the preempt_disable/etc path
>   Our situation matches yours, debug tests run lockdep/etc.

Since Michal requested the current flavour I think we need spin a bit
more on these here. I guess I'll just rebase them to the end so
they're not holding up the others.

> - #4 I like a lot, except the map should enclose range_end too,
>   this can be done after the mm_has_notifiers inside the
>   __mmu_notifier function

To make sure I get this right: The same lockdep context, but also
wrapped around invalidate_range_end? From my understanding of pte
zapping that makes sense, but I'm definitely not well-versed enough
for that.

>   Can you respin?

Will do.

>   I will propose preloading the map in another patch
> - #5 is already applied in -rc

Yup, I'll drop that one.

Thanks, Daniel

Michal Hocko Aug. 20, 2019, 8:18 a.m. UTC | #32

On Fri 16-08-19 11:31:45, Jason Gunthorpe wrote:
> On Fri, Aug 16, 2019 at 02:26:25PM +0200, Michal Hocko wrote:
[...]
> > I believe I have given some examples when introducing __GFP_NOLOCKDEP.
> 
> Okay, I think that is 7e7844226f10 ("lockdep: allow to disable reclaim
> lockup detection") Hmm, sadly the lkml link in the commit is broken.
> 
> Hum. There are no users of __GFP_NOLOCKDEP today?? Could all the false
> positives have been fixed??

I would be more than surprised. Those false possitives were usually
papered over by using GFP_NOFS. And the original plan was to convert
those back to GFP_KERNEL like allocations and use __GFP_NOLOCKDEP.

> Keep in mind that this fs_reclaim was reworked away from the hacky
> interrupt context flag to a fairly elegant real lock map.

I am glad to hear that because that was just too ugly to live.

> Based on my read of the commit message, my first reaction would be to
> suggest lockdep_set_class() to solve the problem described, ie
> different classes for 'inside transaction' and 'outside transaction'
> on xfs_refcountbt_init_cursor()

No this just turned out to be unmaintainable. The number of lock classes
was growing high. I recommend on of the Dave Chinner's rant. Sorry not
link handy.

> I understood that generally that is the way to handle lockdep false
> positives.
> 
> Anyhow, if you are willing to consider that lockdep isn't broken, I
> have some ideas on how to make this clearer and increase
> coverage. Would you be willing to look at patches on this topic? (not
> soon, I have to finish my mmu notifier stuff)

I haven't claimed that the lockdep is broken. It just had problems to
capture some code paths and generated false positives. I would recommend
talking to lockdep maintainers much more than to me because I would have
to dive into the code much more to be useful. I can still comment on the
MM side of the thing of course if that is helpful.

> > > I would like to inject it into the notifier path as this is very
> > > difficult for driver authors to discover and know about, but I'm
> > > worried about your false positive remark.
> > > 
> > > I think I understand we can use only GFP_ATOMIC in the notifiers, but
> > > we need a strategy to handle OOM to guarentee forward progress.
> > 
> > Your example is from the notifier registration IIUC. 
> 
> Yes, that is where this commit hit it.. Triggering this under an
> actual notifier to get a lockdep report is hard.

All you need is to generate a memory pressure. That shouldn't be that
hard.

[2/5] kernel.h: Add non_block_start/end()

Commit Message

Comments

Patch