[1/2] mm/mmu_notifier: Mark up direct reclaim paths with MAYFAIL
diff mbox series

Message ID 20200624080248.3701-1-chris@chris-wilson.co.uk
State New
Headers show
Series
  • [1/2] mm/mmu_notifier: Mark up direct reclaim paths with MAYFAIL
Related show

Commit Message

Chris Wilson June 24, 2020, 8:02 a.m. UTC
When direct reclaim enters the shrinker and tries to reclaim pages, it
has to opportunitically unmap them [try_to_unmap_one]. For direct
reclaim, the calling context is unknown and may include attempts to
unmap one page of a dma object while attempting to allocate more pages
for that object. Pass the information along that we are inside an
opportunistic unmap that can allow that page to remain referenced and
mapped, and let the callback opt in to avoiding a recursive wait.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Jason Gunthorpe <jgg@ziepe.ca>
---
 include/linux/mmu_notifier.h | 15 ++++++++++++++-
 mm/mmu_notifier.c            |  3 +++
 mm/rmap.c                    |  5 +++--
 3 files changed, 20 insertions(+), 3 deletions(-)

Comments

Chris Wilson June 24, 2020, 12:21 p.m. UTC | #1
Quoting Jason Gunthorpe (2020-06-24 13:10:53)
> On Wed, Jun 24, 2020 at 09:02:47AM +0100, Chris Wilson wrote:
> > When direct reclaim enters the shrinker and tries to reclaim pages, it
> > has to opportunitically unmap them [try_to_unmap_one]. For direct
> > reclaim, the calling context is unknown and may include attempts to
> > unmap one page of a dma object while attempting to allocate more pages
> > for that object. Pass the information along that we are inside an
> > opportunistic unmap that can allow that page to remain referenced and
> > mapped, and let the callback opt in to avoiding a recursive wait.
> 
> i915 should already not be holding locks shared with the notifiers
> across allocations that can trigger reclaim. This is already required
> to use notifiers correctly anyhow - why do we need something in the
> notifiers?

for (n = 0; n < num_pages; n++)
	pin_user_page()

may call try_to_unmap_page from the lru shrinker for [0, n-1].

We're in the middle of allocating the object, how are we best to untangle
that?

Or during allocation of something that is using the pages pinned by that
object, how are we best to not to shrink that object as there is a
mutual dependency?
-Chris
Chris Wilson June 24, 2020, 2:12 p.m. UTC | #2
Quoting Jason Gunthorpe (2020-06-24 13:39:10)
> On Wed, Jun 24, 2020 at 01:21:03PM +0100, Chris Wilson wrote:
> > Quoting Jason Gunthorpe (2020-06-24 13:10:53)
> > > On Wed, Jun 24, 2020 at 09:02:47AM +0100, Chris Wilson wrote:
> > > > When direct reclaim enters the shrinker and tries to reclaim pages, it
> > > > has to opportunitically unmap them [try_to_unmap_one]. For direct
> > > > reclaim, the calling context is unknown and may include attempts to
> > > > unmap one page of a dma object while attempting to allocate more pages
> > > > for that object. Pass the information along that we are inside an
> > > > opportunistic unmap that can allow that page to remain referenced and
> > > > mapped, and let the callback opt in to avoiding a recursive wait.
> > > 
> > > i915 should already not be holding locks shared with the notifiers
> > > across allocations that can trigger reclaim. This is already required
> > > to use notifiers correctly anyhow - why do we need something in the
> > > notifiers?
> > 
> > for (n = 0; n < num_pages; n++)
> >       pin_user_page()
> > 
> > may call try_to_unmap_page from the lru shrinker for [0, n-1].
> 
> Yes, of course you can't hold any locks that intersect with notifiers
> across pin_user_page()/get_user_page()

What lock though? It's just the page refcount, shrinker asks us to drop
it [via mmu], we reply we would like to keep using that page as freeing
it for the current allocation is "robbing Peter to pay Paul".
-Chris
Chris Wilson June 24, 2020, 2:21 p.m. UTC | #3
Quoting Jason Gunthorpe (2020-06-24 15:16:04)
> On Wed, Jun 24, 2020 at 03:12:42PM +0100, Chris Wilson wrote:
> > Quoting Jason Gunthorpe (2020-06-24 13:39:10)
> > > On Wed, Jun 24, 2020 at 01:21:03PM +0100, Chris Wilson wrote:
> > > > Quoting Jason Gunthorpe (2020-06-24 13:10:53)
> > > > > On Wed, Jun 24, 2020 at 09:02:47AM +0100, Chris Wilson wrote:
> > > > > > When direct reclaim enters the shrinker and tries to reclaim pages, it
> > > > > > has to opportunitically unmap them [try_to_unmap_one]. For direct
> > > > > > reclaim, the calling context is unknown and may include attempts to
> > > > > > unmap one page of a dma object while attempting to allocate more pages
> > > > > > for that object. Pass the information along that we are inside an
> > > > > > opportunistic unmap that can allow that page to remain referenced and
> > > > > > mapped, and let the callback opt in to avoiding a recursive wait.
> > > > > 
> > > > > i915 should already not be holding locks shared with the notifiers
> > > > > across allocations that can trigger reclaim. This is already required
> > > > > to use notifiers correctly anyhow - why do we need something in the
> > > > > notifiers?
> > > > 
> > > > for (n = 0; n < num_pages; n++)
> > > >       pin_user_page()
> > > > 
> > > > may call try_to_unmap_page from the lru shrinker for [0, n-1].
> > > 
> > > Yes, of course you can't hold any locks that intersect with notifiers
> > > across pin_user_page()/get_user_page()
> > 
> > What lock though? It's just the page refcount, shrinker asks us to drop
> > it [via mmu], we reply we would like to keep using that page as freeing
> > it for the current allocation is "robbing Peter to pay Paul".
> 
> Maybe I'm unclear what this series is actually trying to fix? 
> 
> You said "avoiding a recursive wait" which sounds like some locking
> deadlock to me.

It's the shrinker being called while we are allocating for/on behalf of
the object. As we are actively using the object, we don't want to free
it -- the partial object allocation being the clearest, if the object
consists of 2 pages, trying to free page 0 in order to allocate page 1
has to fail (and the shrinker should find another candidate to reclaim,
or fail the allocation).
-Chris
Chris Wilson June 24, 2020, 2:37 p.m. UTC | #4
Quoting Jason Gunthorpe (2020-06-24 15:25:44)
> On Wed, Jun 24, 2020 at 03:21:49PM +0100, Chris Wilson wrote:
> > Quoting Jason Gunthorpe (2020-06-24 15:16:04)
> > > On Wed, Jun 24, 2020 at 03:12:42PM +0100, Chris Wilson wrote:
> > > > Quoting Jason Gunthorpe (2020-06-24 13:39:10)
> > > > > On Wed, Jun 24, 2020 at 01:21:03PM +0100, Chris Wilson wrote:
> > > > > > Quoting Jason Gunthorpe (2020-06-24 13:10:53)
> > > > > > > On Wed, Jun 24, 2020 at 09:02:47AM +0100, Chris Wilson wrote:
> > > > > > > > When direct reclaim enters the shrinker and tries to reclaim pages, it
> > > > > > > > has to opportunitically unmap them [try_to_unmap_one]. For direct
> > > > > > > > reclaim, the calling context is unknown and may include attempts to
> > > > > > > > unmap one page of a dma object while attempting to allocate more pages
> > > > > > > > for that object. Pass the information along that we are inside an
> > > > > > > > opportunistic unmap that can allow that page to remain referenced and
> > > > > > > > mapped, and let the callback opt in to avoiding a recursive wait.
> > > > > > > 
> > > > > > > i915 should already not be holding locks shared with the notifiers
> > > > > > > across allocations that can trigger reclaim. This is already required
> > > > > > > to use notifiers correctly anyhow - why do we need something in the
> > > > > > > notifiers?
> > > > > > 
> > > > > > for (n = 0; n < num_pages; n++)
> > > > > >       pin_user_page()
> > > > > > 
> > > > > > may call try_to_unmap_page from the lru shrinker for [0, n-1].
> > > > > 
> > > > > Yes, of course you can't hold any locks that intersect with notifiers
> > > > > across pin_user_page()/get_user_page()
> > > > 
> > > > What lock though? It's just the page refcount, shrinker asks us to drop
> > > > it [via mmu], we reply we would like to keep using that page as freeing
> > > > it for the current allocation is "robbing Peter to pay Paul".
> > > 
> > > Maybe I'm unclear what this series is actually trying to fix? 
> > > 
> > > You said "avoiding a recursive wait" which sounds like some locking
> > > deadlock to me.
> > 
> > It's the shrinker being called while we are allocating for/on behalf of
> > the object. As we are actively using the object, we don't want to free
> > it -- the partial object allocation being the clearest, if the object
> > consists of 2 pages, trying to free page 0 in order to allocate page 1
> > has to fail (and the shrinker should find another candidate to reclaim,
> > or fail the allocation).
> 
> mmu notifiers are not for influencing policy of the mm.

It's policy is "this may fail" regardless of the mmu notifier at this
point. That is not changed.

Your suggestion is that we move the pages to the unevictable mapping so
that the shrinker LRU is never invoked on pages we have grabbed with
pin_user_page. Does that work with the rest of the mmu notifiers?
-Chris
Chris Wilson June 24, 2020, 5:58 p.m. UTC | #5
Quoting Jason Gunthorpe (2020-06-24 17:50:57)
> On Wed, Jun 24, 2020 at 03:37:32PM +0100, Chris Wilson wrote:
> > Quoting Jason Gunthorpe (2020-06-24 15:25:44)
> > > On Wed, Jun 24, 2020 at 03:21:49PM +0100, Chris Wilson wrote:
> > > > Quoting Jason Gunthorpe (2020-06-24 15:16:04)
> > > > > On Wed, Jun 24, 2020 at 03:12:42PM +0100, Chris Wilson wrote:
> > > > > > Quoting Jason Gunthorpe (2020-06-24 13:39:10)
> > > > > > > On Wed, Jun 24, 2020 at 01:21:03PM +0100, Chris Wilson wrote:
> > > > > > > > Quoting Jason Gunthorpe (2020-06-24 13:10:53)
> > > > > > > > > On Wed, Jun 24, 2020 at 09:02:47AM +0100, Chris Wilson wrote:
> > > > > > > > > > When direct reclaim enters the shrinker and tries to reclaim pages, it
> > > > > > > > > > has to opportunitically unmap them [try_to_unmap_one]. For direct
> > > > > > > > > > reclaim, the calling context is unknown and may include attempts to
> > > > > > > > > > unmap one page of a dma object while attempting to allocate more pages
> > > > > > > > > > for that object. Pass the information along that we are inside an
> > > > > > > > > > opportunistic unmap that can allow that page to remain referenced and
> > > > > > > > > > mapped, and let the callback opt in to avoiding a recursive wait.
> > > > > > > > > 
> > > > > > > > > i915 should already not be holding locks shared with the notifiers
> > > > > > > > > across allocations that can trigger reclaim. This is already required
> > > > > > > > > to use notifiers correctly anyhow - why do we need something in the
> > > > > > > > > notifiers?
> > > > > > > > 
> > > > > > > > for (n = 0; n < num_pages; n++)
> > > > > > > >       pin_user_page()
> > > > > > > > 
> > > > > > > > may call try_to_unmap_page from the lru shrinker for [0, n-1].
> > > > > > > 
> > > > > > > Yes, of course you can't hold any locks that intersect with notifiers
> > > > > > > across pin_user_page()/get_user_page()
> > > > > > 
> > > > > > What lock though? It's just the page refcount, shrinker asks us to drop
> > > > > > it [via mmu], we reply we would like to keep using that page as freeing
> > > > > > it for the current allocation is "robbing Peter to pay Paul".
> > > > > 
> > > > > Maybe I'm unclear what this series is actually trying to fix? 
> > > > > 
> > > > > You said "avoiding a recursive wait" which sounds like some locking
> > > > > deadlock to me.
> > > > 
> > > > It's the shrinker being called while we are allocating for/on behalf of
> > > > the object. As we are actively using the object, we don't want to free
> > > > it -- the partial object allocation being the clearest, if the object
> > > > consists of 2 pages, trying to free page 0 in order to allocate page 1
> > > > has to fail (and the shrinker should find another candidate to reclaim,
> > > > or fail the allocation).
> > > 
> > > mmu notifiers are not for influencing policy of the mm.
> > 
> > It's policy is "this may fail" regardless of the mmu notifier at this
> > point. That is not changed.
> 
> MMU notifiers are for tracking updates, they are not allowed to fail.
> The one slightly weird case of non-blocking is the only exception.
> 
> > Your suggestion is that we move the pages to the unevictable mapping so
> > that the shrinker LRU is never invoked on pages we have grabbed with
> > pin_user_page. Does that work with the rest of the mmu notifiers?
> 
> That is beyond what I'm familiar with - but generally - if you want to
> influence decisions the MM is making then it needs to be at the
> front of the process and not inside notifiers. 
> 
> So what you describe seems broadly appropriate to me.

Sadly, it's a mlock_vma_page problem all over again.
 
> I'm still a little unclear on what you are trying to fix - pinned
> pages are definitely not freed, do you have some case where pages
> which are pinned are being cleaned out from the MM despite being
> pinned? Sounds a bit strange, maybe that is worth adressing directly?

It suffices to say that pin_user_pages does not prevent try_to_unmap_one
from trying to revoke the page. But we could perhaps slip a
page_maybe_dma_pinned() in around there and see what happens.
-Chris

Patch
diff mbox series

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index fc68f3570e19..ee1ad008951c 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -48,7 +48,8 @@  enum mmu_notifier_event {
 	MMU_NOTIFY_RELEASE,
 };
 
-#define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
+#define MMU_NOTIFIER_RANGE_BLOCKABLE	BIT(0)
+#define MMU_NOTIFIER_RANGE_MAYFAIL	BIT(1)
 
 struct mmu_notifier_ops {
 	/*
@@ -169,6 +170,12 @@  struct mmu_notifier_ops {
 	 * a non-blocking behavior then the same applies to
 	 * invalidate_range_end.
 	 *
+	 * If mayfail is set then the callback may return -EAGAIN while still
+	 * holding its page references. This flag is set inside direct
+	 * reclaim paths that are opportunistically trying to unmap pages
+	 * from unknown contexts. The callback must be prepared to handle
+	 * the matching invalidate_range_end even after failing the
+	 * invalidate_range_start.
 	 */
 	int (*invalidate_range_start)(struct mmu_notifier *subscription,
 				      const struct mmu_notifier_range *range);
@@ -397,6 +404,12 @@  mmu_notifier_range_blockable(const struct mmu_notifier_range *range)
 	return (range->flags & MMU_NOTIFIER_RANGE_BLOCKABLE);
 }
 
+static inline bool
+mmu_notifier_range_mayfail(const struct mmu_notifier_range *range)
+{
+	return (range->flags & MMU_NOTIFIER_RANGE_MAYFAIL);
+}
+
 static inline void mmu_notifier_release(struct mm_struct *mm)
 {
 	if (mm_has_notifiers(mm))
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 352bb9f3ecc0..95b89cee7af4 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -493,6 +493,9 @@  static int mn_hlist_invalidate_range_start(
 			_ret = ops->invalidate_range_start(subscription, range);
 			if (!mmu_notifier_range_blockable(range))
 				non_block_end();
+			if (_ret == -EAGAIN &&
+			    mmu_notifier_range_mayfail(range))
+				_ret = 0;
 			if (_ret) {
 				pr_info("%pS callback failed with %d in %sblockable context.\n",
 					ops->invalidate_range_start, _ret,
diff --git a/mm/rmap.c b/mm/rmap.c
index 5fe2dedce1fc..912b737a3353 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1406,8 +1406,9 @@  static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	 * Note that the page can not be free in this function as call of
 	 * try_to_unmap() must hold a reference on the page.
 	 */
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
-				address,
+	mmu_notifier_range_init(&range,
+				MMU_NOTIFY_CLEAR, MMU_NOTIFIER_RANGE_MAYFAIL,
+				vma, vma->vm_mm, address,
 				min(vma->vm_end, address + page_size(page)));
 	if (PageHuge(page)) {
 		/*