mm, oom: distinguish blockable mode for mmu notifiers

From: Michal Hocko <mhocko@suse.com>

From: Michal Hocko <mhocko@suse.com>

There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.

Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep. That can result in selecting a
new oom victim prematurely because the previous one still hasn't torn
its memory down yet.

We can do much better though. Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held.
Moreover majority of notifiers only care about a portion of the address
space and there is absolutely zero reason to fail when we are unmapping an
unrelated range. Many notifiers do really block and wait for HW which is
harder to handle and we have to bail out though.

This patch handles the low hanging fruid. __mmu_notifier_invalidate_range_start
gets a blockable flag and callbacks are not allowed to sleep if the
flag is set to false. This is achieved by using trylock instead of the
sleepable lock for most callbacks and continue as long as we do not
block down the call chain.

I think we can improve that even further because there is a common
pattern to do a range lookup first and then do something about that.
The first part can be done without a sleeping lock in most cases AFAICS.

The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode. A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.

Changes since rfc v1
- gpu notifiers can sleep while waiting for HW (evict_process_queues_cpsch
  on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure
  we bail out when we have an intersecting range for starter
- note that a notifier failed to the log for easier debugging
- back off early in ib_umem_notifier_invalidate_range_start if the
  callback is called
- mn_invl_range_start waits for completion down the unmap_grant_pages
  path so we have to back off early on overlapping ranges

Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: linux-rdma@vger.kernel.org
Cc: xen-devel@lists.xenproject.org
Cc: linux-mm@kvack.org
Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp
Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
there were no major objections when I sent this as an RFC the last time
[1]. I was hoping for more feedback in the drivers land because I am
touching the code I have no way to test. On the other hand the pattern
is quite simple and consistent over all users so there shouldn't be
any large surprises hopefully.

Any further review would be highly appreciate of course. But is this
something to put into the mm tree now?

[1] http://lkml.kernel.org/r/20180627074421.GF32348@dhcp22.suse.cz

 arch/x86/kvm/x86.c                      |  7 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 43 +++++++++++++++++++-----
 drivers/gpu/drm/i915/i915_gem_userptr.c | 13 ++++++--
 drivers/gpu/drm/radeon/radeon_mn.c      | 22 +++++++++++--
 drivers/infiniband/core/umem_odp.c      | 33 +++++++++++++++----
 drivers/infiniband/hw/hfi1/mmu_rb.c     | 11 ++++---
 drivers/infiniband/hw/mlx5/odp.c        |  2 +-
 drivers/misc/mic/scif/scif_dma.c        |  7 ++--
 drivers/misc/sgi-gru/grutlbpurge.c      |  7 ++--
 drivers/xen/gntdev.c                    | 44 ++++++++++++++++++++-----
 include/linux/kvm_host.h                |  4 +--
 include/linux/mmu_notifier.h            | 35 +++++++++++++++-----
 include/linux/oom.h                     |  2 +-
 include/rdma/ib_umem_odp.h              |  3 +-
 mm/hmm.c                                |  7 ++--
 mm/mmap.c                               |  2 +-
 mm/mmu_notifier.c                       | 19 ++++++++---
 mm/oom_kill.c                           | 29 ++++++++--------
 virt/kvm/kvm_main.c                     | 15 ++++++---
 19 files changed, 225 insertions(+), 80 deletions(-)

mm, oom: distinguish blockable mode for mmu notifiers

Commit Message

Comments

Patch