[v1,for-next,06/16] IB/core: Implement support for MMU notifiers regarding on demand paging regions

From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@redhat.com>

On Thu, Sep 11, 2014 at 12:19:01PM +0000, Shachar Raindel wrote:
> 
> 
> > -----Original Message-----
> > From: Jerome Glisse [mailto:j.glisse@gmail.com]
> > Sent: Wednesday, September 10, 2014 11:15 PM
> > To: Shachar Raindel
> > Cc: Haggai Eran; linux-rdma@vger.kernel.org; Sagi Grimberg
> > Subject: Re: [PATCH v1 for-next 06/16] IB/core: Implement support for
> > MMU notifiers regarding on demand paging regions
> > 
> > On Wed, Sep 10, 2014 at 09:00:36AM +0000, Shachar Raindel wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jerome Glisse [mailto:j.glisse@gmail.com]
> > > > Sent: Tuesday, September 09, 2014 6:37 PM
> > > > To: Shachar Raindel
> > > > Cc: 1404377069-20585-1-git-send-email-haggaie@mellanox.com; Haggai
> > Eran;
> > > > linux-rdma@vger.kernel.org; Jerome Glisse; Sagi Grimberg
> > > > Subject: Re: [PATCH v1 for-next 06/16] IB/core: Implement support
> > for
> > > > MMU notifiers regarding on demand paging regions
> > > >
> > > > On Sun, Sep 07, 2014 at 02:35:59PM +0000, Shachar Raindel wrote:
> > > > > Hi,
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Jerome Glisse [mailto:j.glisse@gmail.com]
> > > > > > Sent: Thursday, September 04, 2014 11:25 PM
> > > > > > To: Haggai Eran; linux-rdma@vger.kernel.org
> > > > > > Cc: Shachar Raindel; Sagi Grimberg
> > > > > > Subject: Re: [PATCH v1 for-next 06/16] IB/core: Implement
> > support
> > > > for
> > > > > > MMU notifiers regarding on demand paging regions
> > > > > >
> 
> <SNIP>
> 
> > > > >
> > > > > Sadly, taking mmap_sem in read-only mode does not prevent all
> > possible
> > > > invalidations from happening.
> > > > > For example, a call to madvise requesting MADVISE_DONTNEED will
> > lock
> > > > the mmap_sem for reading only, allowing a notifier to run in
> > parallel to
> > > > the MR registration As a result, the following sequence of events
> > could
> > > > happen:
> > > > >
> > > > > Thread 1:                       |   Thread 2
> > > > > --------------------------------+-------------------------
> > > > > madvise                         |
> > > > > down_read(mmap_sem)             |
> > > > > notifier_start                  |
> > > > >                                 |   down_read(mmap_sem)
> > > > >                                 |   register_mr
> > > > > notifier_end                    |
> > > > > reduce_mr_notifiers_count       |
> > > > >
> > > > > The end result of this sequence is an mr with running notifiers
> > count
> > > > of -1, which is bad.
> > > > > The current workaround is to avoid decreasing the notifiers count
> > if
> > > > it is zero, which can cause other issues.
> > > > > The proper fix would be to prevent notifiers from running in
> > parallel
> > > > to registration. For this, taking mmap_sem in write mode might be
> > > > sufficient, but we are not sure about this.
> > > > > We will be happy to hear additional input on this subject, to make
> > > > sure we got it covered properly.
> > > >
> > > > So in HMM i solve this by having a struct allocated in the start
> > range
> > > > callback
> > > > and the end range callback just ignore things when it can not find
> > the
> > > > matching
> > > > struct.
> > >
> > > This kind of mechanism sounds like it has a bigger risk for
> > deadlocking
> > > the system, causing an OOM kill without a real need or significantly
> > > slowing down the system.
> > > If you are doing non-atomic memory allocations, you can deadlock the
> > > system by requesting memory in the swapper flow.
> > > Even if you are doing atomic memory allocations, you need to handle
> > the
> > > case of failing allocation, the solution to which is unclear to me.
> > > If you are using a pre-allocated pool, what are you doing when you run
> > > out of available entries in the pool? If you are blocking until some
> > > entries free up, what guarantees you that this will not cause a
> > deadlock?
> > 
> > So i am using a fixed pool and when it runs out it block in start
> > callback
> > until one is freed. 
> 
> This sounds scary. You now create a possible locking dependency between
> two code flows which could have run in parallel. This can cause circular
> locking bugs, from code which functioned properly until now. For example,
> assume code with a single lock, and the following code paths:
> 
> Code 1:
> notify_start()
> lock()
> unlock()
> notify_end()
> 
> Code 2:
> lock()
> notify_start()
> ... (no locking)
> notify_end()
> unlock()
> 

This can not happen because all lock taken before notify_start() are
never taken after it and all lock taken inside a start/end section
are never hold accross a notify_start() callback.

> 
> 
> This code can now create the following deadlock:
> 
> Thread 1:        | Thread 2:
> -----------------+-----------------------------------
> notify_start()   |
>                  | lock()
> lock() - blocking|
>                  | notify_start() - blocking for slot
> 
> 
> 
> 
> > But as i said i have a patch to use the stack that
> > will
> > solve this and avoid a pool.
> 
> How are you allocating from the stack an entry which you need to keep alive
> until another function is called? You can't allocate the entry on the
> notify_start stack, so you must do this in all of the call points to the
> mmu_notifiers. Given the notifiers listener subscription pattern, this seems
> like something which is not practical.

Yes the patch add a struct in each callsite of mmu_notifier_invalidate_range
as in all case both start and end are call from same function. The only draw
back is that it increase stack consumption in some of those callsite (not all).
I attach the patch i am thinking of (it is untested) but idea is that through
two new helper function user of mmu_notifier can query active invalid range and
synchronize with those (also require some code in the range_start() callback).

> 
>  
> > 
> > >
> > > >
> > > > That being said when registering the mmu_notifier you need 2 things,
> > > > first you
> > > > need a pin on the mm (either mm is current ie current->mm or you
> > took a
> > > > reference
> > > > on it). Second you need to that the mmap smemaphore in write mode so
> > > > that
> > > > no concurrent mmap/munmap/madvise can happen. By doing that you
> > protect
> > > > yourself
> > > > from concurrent range_start/range_end that can happen and that does
> > > > matter.
> > > > The only concurrent range_start/end that can happen is through file
> > > > invalidation
> > > > which is fine because subsequent page fault will go through the file
> > > > layer and
> > > > bring back page or return error (if file was truncated for
> > instance).
> > >
> > > Sadly, this is not sufficient for our use case. We are registering
> > > a single MMU notifier handler, and broadcast the notifications to
> > > all relevant listeners, which are stored in an interval tree.
> > >
> > > Each listener represents a section of the address space that has been
> > > exposed to the network. Such implementation allows us to limit the
> > impact
> > > of invalidations, and only block racing page faults to the affected
> > areas.
> > >
> > > Each of the listeners maintain a counter of the number of
> > invalidate_range
> > > notifications that are currently affecting it. The counter is
> > increased
> > > for each invalidate_range_start callback received, and decrease for
> > each
> > > invalidate_range_end callback received. If we add a listener to the
> > > interval tree after the invalidate_range_start callback happened, but
> > > before the invalidate_range_end callback happened, it will decrease
> > the
> > > counter, reaching negative numbers and breaking the logic.
> > >
> > > The mmu_notifiers registration code avoid such issues by taking all
> > > relevant locks on the MM. This effectively blocks all possible
> > notifiers
> > > from happening when registering a new notifier. Sadly, this function
> > is
> > > not exported for modules to use it.
> > >
> > > Our options at the moment are:
> > > - Use a tracking mechanism similar to what HMM uses, alongside the
> > >   challenges involved in allocating memory from notifiers
> > >
> > > - Use a per-process counter for invalidations, causing a possible
> > >   performance degradation. This can possibly be used as a fallback to
> > the
> > >   first option (i.e. have a pool of X notifier identifiers, once it is
> > >   full, increase/decrease a per-MM counter)
> > >
> > > - Export the mm_take_all_locks function for modules. This will allow
> > us
> > >   to lock the MM when adding a new listener.
> > 
> > I was not clear enough, you need to take the mmap_sem in write mode
> > accross
> > mmu_notifier_register(). This is only to partialy solve your issue that
> > if
> > a mmu_notifier is already register for the mm you are trying to
> > registering
> > against then there is a chance for you to be inside an active
> > range_start/
> > range_end section which would lead to invalid counter inside your
> > tracking
> > structure. But, sadly, taking mmap_sem in write mode is not enough as
> > file
> > invalidation might still happen concurrently so you will need to make
> > sure
> > you invalidation counters does not go negative but from page fault point
> > of
> > view you will be fine because the page fault will synchronize through
> > the
> > pagecache. So scenario (A and B are to anonymous overlapping address
> > range) :
> > 
> >   APP_TOTO_RDMA_THREAD           |  APP_TOTO_SOME_OTHER_THREAD
> >                                  |  mmu_notifier_invalidate_range_start(A)
> >   odp_register()                 |
> >     down_read(mmap_sem)          |
> >     mmu_notifier_register()      |
> >     up_read(mmap_sem)            |
> >   odp_add_new_region(B)          |
> >   odp_page_fault(B)              |
> >     down_read(mmap_sem)          |
> >     ...                          |
> >     up_read(mmap_sem)            |
> >                                  |  mmu_notifier_invalidate_range_end(A)
> > 
> > The odp_page_fault(B) might see invalid cpu page table but you have no
> > idea
> > about it because you registered after the range_start(). But if you take
> > the
> > mmap_sem in write mode then the only case where you might still have
> > this
> > scenario is if A and B are range of a file backed vma and that the file
> > is
> > undergoing some change (most likely truncation). But the file case is
> > fine
> > because the odp_page_fault() will go through the pagecache which is
> > properly
> > synchronize against the current range invalidation.
> 
> Specifically, if you call mmu_notifier_register you are OK and the above
> scenario will not happen. You are supposed to hold mmap_sem for writing,
> and mmu_notifier_register is calling mm_take_all_locks, which guarantees
> no racing notifier during the registration step.
> 
> However, we want to dynamically add sub-notifiers in our code. Each will
> get notified only about invalidations touching a specific sub-sections of
> the address space. To avoid providing unneeded notifications, we use an
> interval tree that filters only the needed notifications.
> When adding entries to the interval tree, we cannot lock the mm to prevent
> any racing invalidations. As such, we might end up in a case where a newly
> registered memory region will get a "notify_end" call without the relevant
> "notify_start". Even if we prevent the value from dropping below zero, it
> means we can cause data corruption. For example, if we have another
> notifier running after the MR registers, which is due to munmap, but we get
> first the notify_end of the previous notifier for which we didn't see the
> notify_start.
> 
> The solution we are coming up with now is using a global counter of running
> invalidations for new regions allocated. When the global counter is at zero,
> we can safely switch to the region local invalidations counter.

Yes i fully understood that design but as i said this kind of broken and this
is what the attached patch try to address as HMM have the same issue of having
to track all active invalidation range.

> 
> 
> > 
> > 
> > Now for the the general case outside of mmu_notifier_register() HMM also
> > track
> > active invalidation range to avoid page faulting into those range as we
> > can not
> > trust the cpu page table for as long as the range invalidation is on
> > going.
> > 
> > > >
> > > > So as long as you hold the mmap_sem in write mode you should not
> > worry
> > > > about
> > > > concurrent range_start/range_end (well they might happen but only
> > for
> > > > file
> > > > backed vma).
> > > >
> > >
> > > Sadly, the mmap_sem is not enough to protect us :(.
> > 
> > This is enough like i explain above, but i am only talking about the mmu
> > notifier registration. For the general case once you register you only
> > need to take the mmap_sem in read mode during page fault.
> > 
> 
> I think we are not broadcasting on the same wavelength here. The issue I'm
> worried about is of adding a sub-area to our tracking system. It is built
> quite differently from how HMM is built, we are defining areas to track
> a-priori, and later on account how many notifiers are blocking page-faults
> for each area. You are keeping track of the active notifiers, and check
> each page fault against your notifier list. This difference makes for
> different locking needs.
> 
> > > > Given that you face the same issue as i have with the
> > > > range_start/range_end i
> > > > will stich up a patch to make it easier to track those.
> > > >
> > >
> > > That would be nice, especially if we could easily integrate it into
> > our
> > > code and reduce the code size.
> > 
> > Yes it's a "small modification" to the mmu_notifier api, i have been
> > side
> > tracked on other thing. But i will have it soon.
> > 
> 
> Being side tracked is a well-known professional risk in our line of work ;)
> 
> 
> > >
> > > > Cheers,
> > > > Jérôme
> > > >
> > > >
From 037195e49fbed468d16b78f0364fe302bc732d12 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@redhat.com>
Date: Thu, 11 Sep 2014 11:22:12 -0400
Subject: [PATCH] mmu_notifier: keep track of active invalidation ranges
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The mmu_notifier_invalidate_range_start() and mmu_notifier_invalidate_range_end()
can be considered as forming an "atomic" section for the cpu page table update
point of view. Between this two function the cpu page table content is unreliable
for the affected range of address.

Current user such as kvm need to know when they can trust a the content of the
cpu page table. This becomes even more important to new users of the mmu_notifier
api (such as HMM or ODP).

This patch use a structure define at all call site to invalidate_range_start()
that is added to a list for the duration of the invalidation. It adds two new
helpers to allow querying if a range is being invalidated or to wait for a range
to become valid.

This two new function does not provide strong synchronization but are intended
to be use as helper. User of the mmu_notifier must also synchronize with themself
inside their range_start() and range_end() callback.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 drivers/gpu/drm/i915/i915_gem_userptr.c | 13 +++---
 drivers/iommu/amd_iommu_v2.c            |  8 +---
 drivers/misc/sgi-gru/grutlbpurge.c      | 15 +++----
 drivers/xen/gntdev.c                    |  8 ++--
 fs/proc/task_mmu.c                      | 12 +++--
 include/linux/mmu_notifier.h            | 55 ++++++++++++-----------
 mm/fremap.c                             |  8 +++-
 mm/huge_memory.c                        | 78 ++++++++++++++-------------------
 mm/hugetlb.c                            | 49 +++++++++++----------
 mm/memory.c                             | 73 ++++++++++++++++--------------
 mm/migrate.c                            | 16 +++----
 mm/mmu_notifier.c                       | 73 +++++++++++++++++++++++++-----
 mm/mprotect.c                           | 17 ++++---
 mm/mremap.c                             | 14 +++---
 mm/rmap.c                               | 15 +++----
 virt/kvm/kvm_main.c                     | 10 ++---
 16 files changed, 256 insertions(+), 208 deletions(-)

[v1,for-next,06/16] IB/core: Implement support for MMU notifiers regarding on demand paging regions

Commit Message

Comments

Patch