mbox series

[v5,0/9] mmu notifier provide context informations

Message ID 20190219200430.11130-1-jglisse@redhat.com (mailing list archive)
Headers show
Series mmu notifier provide context informations | expand

Message

Jerome Glisse Feb. 19, 2019, 8:04 p.m. UTC
From: Jérôme Glisse <jglisse@redhat.com>

Since last version [4] i added the extra bits needed for the change_pte
optimization (which is a KSM thing). Here i am not posting users of
this, they will be posted to the appropriate sub-systems (KVM, GPU,
RDMA, ...) once this serie get upstream. If you want to look at users
of this see [5] [6]. If this gets in 5.1 then i will be submitting
those users for 5.2 (including KVM if KVM folks feel comfortable with
it).

Note that this serie does not change any behavior for any existing
code. It just pass down more informations to mmu notifier listener.

The rational for this patchset:


CPU page table update can happens for many reasons, not only as a
result of a syscall (munmap(), mprotect(), mremap(), madvise(), ...)
but also as a result of kernel activities (memory compression, reclaim,
migration, ...).

This patch introduce a set of enums that can be associated with each
of the events triggering a mmu notifier:

    - UNMAP: munmap() or mremap()
    - CLEAR: page table is cleared (migration, compaction, reclaim, ...)
    - PROTECTION_VMA: change in access protections for the range
    - PROTECTION_PAGE: change in access protections for page in the range
    - SOFT_DIRTY: soft dirtyness tracking

Being able to identify munmap() and mremap() from other reasons why the
page table is cleared is important to allow user of mmu notifier to
update their own internal tracking structure accordingly (on munmap or
mremap it is not longer needed to track range of virtual address as it
becomes invalid). Without this serie, driver are force to assume that
every notification is an munmap which triggers useless trashing within
drivers that associate structure with range of virtual address. Each
driver is force to free up its tracking structure and then restore it
on next device page fault. With this serie we can also optimize device
page table update [5].

More over this can also be use to optimize out some page table updates
like for KVM where we can update the secondary MMU directly from the
callback instead of clearing it.

Patches to leverage this serie will be posted separately to each sub-
system.

Cheers,
Jérôme

[1] v1 https://lkml.org/lkml/2018/3/23/1049
[2] v2 https://lkml.org/lkml/2018/12/5/10
[3] v3 https://lkml.org/lkml/2018/12/13/620
[4] v4 https://lkml.org/lkml/2019/1/23/838
[5] patches to use this:
    https://lkml.org/lkml/2019/1/23/833
    https://lkml.org/lkml/2019/1/23/834
    https://lkml.org/lkml/2019/1/23/832
    https://lkml.org/lkml/2019/1/23/831
[6] KVM restore change pte optimization
    https://patchwork.kernel.org/cover/10791179/

Cc: Christian König <christian.koenig@amd.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: kvm@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: linux-rdma@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: Arnd Bergmann <arnd@arndb.de>

Jérôme Glisse (9):
  mm/mmu_notifier: helper to test if a range invalidation is blockable
  mm/mmu_notifier: convert user range->blockable to helper function
  mm/mmu_notifier: convert mmu_notifier_range->blockable to a flags
  mm/mmu_notifier: contextual information for event enums
  mm/mmu_notifier: contextual information for event triggering
    invalidation v2
  mm/mmu_notifier: use correct mmu_notifier events for each invalidation
  mm/mmu_notifier: pass down vma and reasons why mmu notifier is
    happening v2
  mm/mmu_notifier: mmu_notifier_range_update_to_read_only() helper
  mm/mmu_notifier: set MMU_NOTIFIER_USE_CHANGE_PTE flag where
    appropriate v2

 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  |  8 +--
 drivers/gpu/drm/i915/i915_gem_userptr.c |  2 +-
 drivers/gpu/drm/radeon/radeon_mn.c      |  4 +-
 drivers/infiniband/core/umem_odp.c      |  5 +-
 drivers/xen/gntdev.c                    |  6 +-
 fs/proc/task_mmu.c                      |  3 +-
 include/linux/mmu_notifier.h            | 93 +++++++++++++++++++++++--
 kernel/events/uprobes.c                 |  3 +-
 mm/hmm.c                                |  6 +-
 mm/huge_memory.c                        | 14 ++--
 mm/hugetlb.c                            | 12 ++--
 mm/khugepaged.c                         |  3 +-
 mm/ksm.c                                |  9 ++-
 mm/madvise.c                            |  3 +-
 mm/memory.c                             | 26 ++++---
 mm/migrate.c                            |  5 +-
 mm/mmu_notifier.c                       | 12 +++-
 mm/mprotect.c                           |  4 +-
 mm/mremap.c                             |  3 +-
 mm/oom_kill.c                           |  3 +-
 mm/rmap.c                               |  6 +-
 virt/kvm/kvm_main.c                     |  3 +-
 22 files changed, 180 insertions(+), 53 deletions(-)

Comments

Dan Williams Feb. 19, 2019, 8:15 p.m. UTC | #1
On Tue, Feb 19, 2019 at 12:04 PM <jglisse@redhat.com> wrote:
>
> From: Jérôme Glisse <jglisse@redhat.com>
>
> Since last version [4] i added the extra bits needed for the change_pte
> optimization (which is a KSM thing). Here i am not posting users of
> this, they will be posted to the appropriate sub-systems (KVM, GPU,
> RDMA, ...) once this serie get upstream. If you want to look at users
> of this see [5] [6]. If this gets in 5.1 then i will be submitting
> those users for 5.2 (including KVM if KVM folks feel comfortable with
> it).

The users look small and straightforward. Why not await acks and
reviewed-by's for the users like a typical upstream submission and
merge them together? Is all of the functionality of this
infrastructure consumed by the proposed users? Last time I checked it
was only a subset.
Jerome Glisse Feb. 19, 2019, 8:30 p.m. UTC | #2
On Tue, Feb 19, 2019 at 12:15:55PM -0800, Dan Williams wrote:
> On Tue, Feb 19, 2019 at 12:04 PM <jglisse@redhat.com> wrote:
> >
> > From: Jérôme Glisse <jglisse@redhat.com>
> >
> > Since last version [4] i added the extra bits needed for the change_pte
> > optimization (which is a KSM thing). Here i am not posting users of
> > this, they will be posted to the appropriate sub-systems (KVM, GPU,
> > RDMA, ...) once this serie get upstream. If you want to look at users
> > of this see [5] [6]. If this gets in 5.1 then i will be submitting
> > those users for 5.2 (including KVM if KVM folks feel comfortable with
> > it).
> 
> The users look small and straightforward. Why not await acks and
> reviewed-by's for the users like a typical upstream submission and
> merge them together? Is all of the functionality of this
> infrastructure consumed by the proposed users? Last time I checked it
> was only a subset.

Yes pretty much all is use, the unuse case is SOFT_DIRTY and CLEAR
vs UNMAP. Both of which i intend to use. The RDMA folks already ack
the patches IIRC, so did radeon and amdgpu. I believe the i915 folks
were ok with it too. I do not want to merge things through Andrew
for all of this we discussed that in the past, merge mm bits through
Andrew in one release and bits that use things in the next release.

Cheers,
Jérôme
Jason Gunthorpe Feb. 19, 2019, 8:40 p.m. UTC | #3
On Tue, Feb 19, 2019 at 03:30:33PM -0500, Jerome Glisse wrote:
> On Tue, Feb 19, 2019 at 12:15:55PM -0800, Dan Williams wrote:
> > On Tue, Feb 19, 2019 at 12:04 PM <jglisse@redhat.com> wrote:
> > >
> > > From: Jérôme Glisse <jglisse@redhat.com>
> > >
> > > Since last version [4] i added the extra bits needed for the change_pte
> > > optimization (which is a KSM thing). Here i am not posting users of
> > > this, they will be posted to the appropriate sub-systems (KVM, GPU,
> > > RDMA, ...) once this serie get upstream. If you want to look at users
> > > of this see [5] [6]. If this gets in 5.1 then i will be submitting
> > > those users for 5.2 (including KVM if KVM folks feel comfortable with
> > > it).
> > 
> > The users look small and straightforward. Why not await acks and
> > reviewed-by's for the users like a typical upstream submission and
> > merge them together? Is all of the functionality of this
> > infrastructure consumed by the proposed users? Last time I checked it
> > was only a subset.
> 
> Yes pretty much all is use, the unuse case is SOFT_DIRTY and CLEAR
> vs UNMAP. Both of which i intend to use. The RDMA folks already ack
> the patches IIRC, so did radeon and amdgpu. I believe the i915 folks
> were ok with it too. I do not want to merge things through Andrew
> for all of this we discussed that in the past, merge mm bits through
> Andrew in one release and bits that use things in the next release.

It is usually cleaner for everyone to split patches like this, for
instance I always prefer to merge RDMA patches via RDMA when
possible. Less conflicts.

The other somewhat reasonable option is to get acks and send your own
complete PR to Linus next week? That works OK for tree-wide changes.

Jason
Dan Williams Feb. 19, 2019, 8:40 p.m. UTC | #4
On Tue, Feb 19, 2019 at 12:30 PM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Tue, Feb 19, 2019 at 12:15:55PM -0800, Dan Williams wrote:
> > On Tue, Feb 19, 2019 at 12:04 PM <jglisse@redhat.com> wrote:
> > >
> > > From: Jérôme Glisse <jglisse@redhat.com>
> > >
> > > Since last version [4] i added the extra bits needed for the change_pte
> > > optimization (which is a KSM thing). Here i am not posting users of
> > > this, they will be posted to the appropriate sub-systems (KVM, GPU,
> > > RDMA, ...) once this serie get upstream. If you want to look at users
> > > of this see [5] [6]. If this gets in 5.1 then i will be submitting
> > > those users for 5.2 (including KVM if KVM folks feel comfortable with
> > > it).
> >
> > The users look small and straightforward. Why not await acks and
> > reviewed-by's for the users like a typical upstream submission and
> > merge them together? Is all of the functionality of this
> > infrastructure consumed by the proposed users? Last time I checked it
> > was only a subset.
>
> Yes pretty much all is use, the unuse case is SOFT_DIRTY and CLEAR
> vs UNMAP. Both of which i intend to use. The RDMA folks already ack
> the patches IIRC, so did radeon and amdgpu. I believe the i915 folks
> were ok with it too. I do not want to merge things through Andrew
> for all of this we discussed that in the past, merge mm bits through
> Andrew in one release and bits that use things in the next release.

Ok, I was trying to find the links to the acks on the mailing list,
those references would address my concerns. I see no reason to rush
SOFT_DIRTY and CLEAR ahead of the upstream user.
Dan Williams Feb. 19, 2019, 8:49 p.m. UTC | #5
On Tue, Feb 19, 2019 at 12:41 PM Jason Gunthorpe <jgg@mellanox.com> wrote:
>
> On Tue, Feb 19, 2019 at 03:30:33PM -0500, Jerome Glisse wrote:
> > On Tue, Feb 19, 2019 at 12:15:55PM -0800, Dan Williams wrote:
> > > On Tue, Feb 19, 2019 at 12:04 PM <jglisse@redhat.com> wrote:
> > > >
> > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > >
> > > > Since last version [4] i added the extra bits needed for the change_pte
> > > > optimization (which is a KSM thing). Here i am not posting users of
> > > > this, they will be posted to the appropriate sub-systems (KVM, GPU,
> > > > RDMA, ...) once this serie get upstream. If you want to look at users
> > > > of this see [5] [6]. If this gets in 5.1 then i will be submitting
> > > > those users for 5.2 (including KVM if KVM folks feel comfortable with
> > > > it).
> > >
> > > The users look small and straightforward. Why not await acks and
> > > reviewed-by's for the users like a typical upstream submission and
> > > merge them together? Is all of the functionality of this
> > > infrastructure consumed by the proposed users? Last time I checked it
> > > was only a subset.
> >
> > Yes pretty much all is use, the unuse case is SOFT_DIRTY and CLEAR
> > vs UNMAP. Both of which i intend to use. The RDMA folks already ack
> > the patches IIRC, so did radeon and amdgpu. I believe the i915 folks
> > were ok with it too. I do not want to merge things through Andrew
> > for all of this we discussed that in the past, merge mm bits through
> > Andrew in one release and bits that use things in the next release.
>
> It is usually cleaner for everyone to split patches like this, for
> instance I always prefer to merge RDMA patches via RDMA when
> possible. Less conflicts.
>
> The other somewhat reasonable option is to get acks and send your own
> complete PR to Linus next week? That works OK for tree-wide changes.

Yes, I'm not proposing that they be merged together, instead I'm just
looking for the acked-by / reviewed-by tags even if those patches are
targeting the next merge window.
Jerome Glisse Feb. 19, 2019, 8:57 p.m. UTC | #6
On Tue, Feb 19, 2019 at 12:40:37PM -0800, Dan Williams wrote:
> On Tue, Feb 19, 2019 at 12:30 PM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Tue, Feb 19, 2019 at 12:15:55PM -0800, Dan Williams wrote:
> > > On Tue, Feb 19, 2019 at 12:04 PM <jglisse@redhat.com> wrote:
> > > >
> > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > >
> > > > Since last version [4] i added the extra bits needed for the change_pte
> > > > optimization (which is a KSM thing). Here i am not posting users of
> > > > this, they will be posted to the appropriate sub-systems (KVM, GPU,
> > > > RDMA, ...) once this serie get upstream. If you want to look at users
> > > > of this see [5] [6]. If this gets in 5.1 then i will be submitting
> > > > those users for 5.2 (including KVM if KVM folks feel comfortable with
> > > > it).
> > >
> > > The users look small and straightforward. Why not await acks and
> > > reviewed-by's for the users like a typical upstream submission and
> > > merge them together? Is all of the functionality of this
> > > infrastructure consumed by the proposed users? Last time I checked it
> > > was only a subset.
> >
> > Yes pretty much all is use, the unuse case is SOFT_DIRTY and CLEAR
> > vs UNMAP. Both of which i intend to use. The RDMA folks already ack
> > the patches IIRC, so did radeon and amdgpu. I believe the i915 folks
> > were ok with it too. I do not want to merge things through Andrew
> > for all of this we discussed that in the past, merge mm bits through
> > Andrew in one release and bits that use things in the next release.
> 
> Ok, I was trying to find the links to the acks on the mailing list,
> those references would address my concerns. I see no reason to rush
> SOFT_DIRTY and CLEAR ahead of the upstream user.

I intend to post user for those in next couple weeks for 5.2 HMM bits.
So user for this (CLEAR/UNMAP/SOFTDIRTY) will definitly materialize in
time for 5.2.

ACKS AMD/RADEON https://lkml.org/lkml/2019/2/1/395
ACKS RDMA https://lkml.org/lkml/2018/12/6/1473

For KVM Andrea Arcangeli seems to like the whole idea to restore the
change_pte optimization but i have not got ACK from Radim or Paolo,
however given the small performance improvement figure i get with it
i do not see while they would not ACK.

https://lkml.org/lkml/2019/2/18/1530

Cheers,
Jérôme
Dan Williams Feb. 19, 2019, 9:19 p.m. UTC | #7
On Tue, Feb 19, 2019 at 12:58 PM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Tue, Feb 19, 2019 at 12:40:37PM -0800, Dan Williams wrote:
> > On Tue, Feb 19, 2019 at 12:30 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > >
> > > On Tue, Feb 19, 2019 at 12:15:55PM -0800, Dan Williams wrote:
> > > > On Tue, Feb 19, 2019 at 12:04 PM <jglisse@redhat.com> wrote:
> > > > >
> > > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > > >
> > > > > Since last version [4] i added the extra bits needed for the change_pte
> > > > > optimization (which is a KSM thing). Here i am not posting users of
> > > > > this, they will be posted to the appropriate sub-systems (KVM, GPU,
> > > > > RDMA, ...) once this serie get upstream. If you want to look at users
> > > > > of this see [5] [6]. If this gets in 5.1 then i will be submitting
> > > > > those users for 5.2 (including KVM if KVM folks feel comfortable with
> > > > > it).
> > > >
> > > > The users look small and straightforward. Why not await acks and
> > > > reviewed-by's for the users like a typical upstream submission and
> > > > merge them together? Is all of the functionality of this
> > > > infrastructure consumed by the proposed users? Last time I checked it
> > > > was only a subset.
> > >
> > > Yes pretty much all is use, the unuse case is SOFT_DIRTY and CLEAR
> > > vs UNMAP. Both of which i intend to use. The RDMA folks already ack
> > > the patches IIRC, so did radeon and amdgpu. I believe the i915 folks
> > > were ok with it too. I do not want to merge things through Andrew
> > > for all of this we discussed that in the past, merge mm bits through
> > > Andrew in one release and bits that use things in the next release.
> >
> > Ok, I was trying to find the links to the acks on the mailing list,
> > those references would address my concerns. I see no reason to rush
> > SOFT_DIRTY and CLEAR ahead of the upstream user.
>
> I intend to post user for those in next couple weeks for 5.2 HMM bits.
> So user for this (CLEAR/UNMAP/SOFTDIRTY) will definitly materialize in
> time for 5.2.
>
> ACKS AMD/RADEON https://lkml.org/lkml/2019/2/1/395
> ACKS RDMA https://lkml.org/lkml/2018/12/6/1473

Nice, thanks!

> For KVM Andrea Arcangeli seems to like the whole idea to restore the
> change_pte optimization but i have not got ACK from Radim or Paolo,
> however given the small performance improvement figure i get with it
> i do not see while they would not ACK.

Sure, but no need to push ahead without that confirmation, right? At
least for the piece that KVM cares about, maybe that's already covered
in the infrastructure RDMA and RADEON are using?
Jerome Glisse Feb. 19, 2019, 9:30 p.m. UTC | #8
On Tue, Feb 19, 2019 at 01:19:09PM -0800, Dan Williams wrote:
> On Tue, Feb 19, 2019 at 12:58 PM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Tue, Feb 19, 2019 at 12:40:37PM -0800, Dan Williams wrote:
> > > On Tue, Feb 19, 2019 at 12:30 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > > >
> > > > On Tue, Feb 19, 2019 at 12:15:55PM -0800, Dan Williams wrote:
> > > > > On Tue, Feb 19, 2019 at 12:04 PM <jglisse@redhat.com> wrote:
> > > > > >
> > > > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > > > >
> > > > > > Since last version [4] i added the extra bits needed for the change_pte
> > > > > > optimization (which is a KSM thing). Here i am not posting users of
> > > > > > this, they will be posted to the appropriate sub-systems (KVM, GPU,
> > > > > > RDMA, ...) once this serie get upstream. If you want to look at users
> > > > > > of this see [5] [6]. If this gets in 5.1 then i will be submitting
> > > > > > those users for 5.2 (including KVM if KVM folks feel comfortable with
> > > > > > it).
> > > > >
> > > > > The users look small and straightforward. Why not await acks and
> > > > > reviewed-by's for the users like a typical upstream submission and
> > > > > merge them together? Is all of the functionality of this
> > > > > infrastructure consumed by the proposed users? Last time I checked it
> > > > > was only a subset.
> > > >
> > > > Yes pretty much all is use, the unuse case is SOFT_DIRTY and CLEAR
> > > > vs UNMAP. Both of which i intend to use. The RDMA folks already ack
> > > > the patches IIRC, so did radeon and amdgpu. I believe the i915 folks
> > > > were ok with it too. I do not want to merge things through Andrew
> > > > for all of this we discussed that in the past, merge mm bits through
> > > > Andrew in one release and bits that use things in the next release.
> > >
> > > Ok, I was trying to find the links to the acks on the mailing list,
> > > those references would address my concerns. I see no reason to rush
> > > SOFT_DIRTY and CLEAR ahead of the upstream user.
> >
> > I intend to post user for those in next couple weeks for 5.2 HMM bits.
> > So user for this (CLEAR/UNMAP/SOFTDIRTY) will definitly materialize in
> > time for 5.2.
> >
> > ACKS AMD/RADEON https://lkml.org/lkml/2019/2/1/395
> > ACKS RDMA https://lkml.org/lkml/2018/12/6/1473
> 
> Nice, thanks!
> 
> > For KVM Andrea Arcangeli seems to like the whole idea to restore the
> > change_pte optimization but i have not got ACK from Radim or Paolo,
> > however given the small performance improvement figure i get with it
> > i do not see while they would not ACK.
> 
> Sure, but no need to push ahead without that confirmation, right? At
> least for the piece that KVM cares about, maybe that's already covered
> in the infrastructure RDMA and RADEON are using?

The change_pte() for KVM is just one bit flag on top of the rest. So
i don't see much value in saving this last patch. I will be working
with KVM folks to merge KVM bits in 5.2. If they do not want that then
removing that extra flags is not much work.

But if you prefer than Andrew can drop the last patch in the serie.

Cheers,
Jérôme