From patchwork Mon Feb 24 20:30:33 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Michel Lespinasse X-Patchwork-Id: 11401483 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id F152D924 for ; Mon, 24 Feb 2020 20:31:03 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 9088C20CC7 for ; Mon, 24 Feb 2020 20:31:03 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="YWPbyfua" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9088C20CC7 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 966626B000D; Mon, 24 Feb 2020 15:31:02 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 9172F6B000E; Mon, 24 Feb 2020 15:31:02 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 82C7D6B0010; Mon, 24 Feb 2020 15:31:02 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0024.hostedemail.com [216.40.44.24]) by kanga.kvack.org (Postfix) with ESMTP id 678296B000D for ; Mon, 24 Feb 2020 15:31:02 -0500 (EST) Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 26B1D181AC9CC for ; Mon, 24 Feb 2020 20:31:02 +0000 (UTC) X-FDA: 76526164764.02.shop40_260aee7e6ec31 X-Spam-Summary: 2,0,0,a8e8328cbadca998,d41d8cd98f00b204,3gzjuxgykcdikozysbuccuzs.qcazwbil-aayjoqy.cfu@flex--walken.bounces.google.com,,RULES_HIT:4:41:152:355:379:541:968:973:988:989:1260:1277:1313:1314:1345:1437:1516:1518:1593:1594:1605:1730:1747:1777:1792:1801:2393:2553:2559:2562:2689:2895:2901:3138:3139:3140:3141:3142:3152:3865:3866:3867:3868:3870:3871:3872:3874:4184:4250:4605:5007:6119:6120:6261:6653:7875:7901:7903:8660:9969:11026:11232:11473:11658:11914:12043:12048:12050:12296:12297:12438:12555:12663:12679:12895:13138:13148:13149:13161:13229:13230:13231:14096:14097:14659:21060:21080:21433:21444:21451:21627:21740:21773:21796:21987:30003:30010:30012:30036:30054:30070:30079:30083:30090:30091,0,RBL:209.85.216.73:@flex--walken.bounces.google.com:.lbl8.mailshell.net-62.18.0.100 66.100.201.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:neutral,Custom_rules:0:1:0,LFtime:30,LUA_SUMMARY:none X-HE-Tag: shop40_260aee7e6ec31 X-Filterd-Recvd-Size: 19786 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) by imf03.hostedemail.com (Postfix) with ESMTP for ; Mon, 24 Feb 2020 20:31:01 +0000 (UTC) Received: by mail-pj1-f73.google.com with SMTP id i3so405917pjx.8 for ; Mon, 24 Feb 2020 12:31:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:message-id:mime-version:subject:from:to:cc; bh=UrmDgT5qvQNdlFbnO8p4tvAU4kA31F2DJh5jYjbiB1E=; b=YWPbyfua6d/4IoHKHNs22D4XHiofZH6m6KEckQKVzmVzPDmG1miLAI7T/HEx5GrTfI QqYjDeHOR0MAeWg4VaNIIfnHlI9ogSQcEpBo9Y34cO1SjsSpxJ1cfLr+3hNAG4xQw0s/ ioh74WGfC/Ra3o+Zk79v9vlLLuomHFalmq3PSDQI/wibX+1AJYyIESg8kDA+DzsNSMn3 LXWCqulj76Ac8Zdni6nUoy2Pp3ymXzryrnCdQlAjFHI7k2bFNcW8gi/IJaZRh5TLMvLl fc9iuaXTues+hVYN22R/mPA+bGuhuNKE84C8cE6hxEPFxdhIg3cFRchXIgfv+qlt1Wfe 3HPw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:message-id:mime-version:subject:from:to:cc; bh=UrmDgT5qvQNdlFbnO8p4tvAU4kA31F2DJh5jYjbiB1E=; b=MQ79Vo3u8aADzMKSnDYyBgfq5Zrpa7wYj0pChF1/Q5/x4lE3k6OWQSb9fCRFMMA7dW uyJabkYZIWq2p2knuyh20839Sw3SSwV0Bn3wjlYAcYU2qQ5T5Mge35GmWfnrISsFJUOn R/NELc2XGvDizbOsq+V7WlsRTjqnYSGpT9awr29O7oYQ+0ftLnjL1c5gjKT0C7FJN3zo rQnRxMIpjyRUyKaAY6VVYugKeP5rVAyf2QMuSvT8AO70FhZZqPMKikVrApTVlAQnfrQs qvQ+MACVY/wZWeXBNZGgYVVGQHSTYP69UugpsClqELVtsNIIHtRuSmCnQXqQAQOCuwT+ Tljw== X-Gm-Message-State: APjAAAXWub3zIwVXDe3F91EwV6fCpuEWy1sTv8nI4Cb/kK1nfDHODrN0 ypQBJ6YQ8uyrJcwJ++NgWzQTHCQKZRA= X-Google-Smtp-Source: APXvYqxaB7RHBRYyhE4gcddsa3Ko3Oh4CvKJjddtvR8BMOH8ULBE1HBXyuaJi0eQFzbGfIBP9ziP3AHpAQ4= X-Received: by 2002:a63:306:: with SMTP id 6mr54187209pgd.337.1582576259871; Mon, 24 Feb 2020 12:30:59 -0800 (PST) Date: Mon, 24 Feb 2020 12:30:33 -0800 Message-Id: <20200224203057.162467-1-walken@google.com> Mime-Version: 1.0 X-Mailer: git-send-email 2.25.0.265.gbab2e86ba0-goog Subject: [RFC PATCH 00/24] Fine grained MM locking From: Michel Lespinasse To: Peter Zijlstra , Andrew Morton , Laurent Dufour , Vlastimil Babka , Matthew Wilcox , "Liam R . Howlett" , Jerome Glisse , Davidlohr Bueso , David Rientjes Cc: linux-mm , Michel Lespinasse X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi, This is the first version of my work towards fine grained MM locking. This is still early work - I am happy with my page fault changes, but want to expand on the mmap/munmap side of things before I send the next version. I have previously shared this with some of the copied folks (for those who received that, there are no additional changes in this public resend). Please expect a v2 within a few weeks, with further changes for fine grained range locking in the mmap and munmap paths. This work originated in discussions at LSF/MM 2019; it is intended to address the latency issues that are caused by false conflicts between threads working on separate parts of their address space. The priorities are to keep things as simple as possible, and to allow for progressive conversion of the code base to finer grained MM locks. The general approach is to replace the mmap_sem rwsem with a range lock. Initially all lock/unlock sites are automatically converted to lock the entire address space through a new API. Then, the API is extended to support range locking. Locking sites can then be progressively converted to use range locking, while leaving unconverted sites working with no code changes. When using a range lock (as opposed to a coarse lock), the following rules apply: - Some structures (notably the vma rbtree and associated statistics) are per-mm. They need to be locked separately using a new mm_vma_lock. The entire point of this patch set is to reduce false sharing latencies, so the mm_vma_lock must be held only for short times. We expect to do O(log N) operations holding the lock (for example, walking or updating the vma rbtree) but no O(N) operations (such as iterating on all vmas within a range or all mapped pages within a range). - Code holding the mm_vma_lock should only update vma attributes for the range it has a write lock for. However, range locks only protects the vma's attributes, not the vmas themselves - vmas can still be split or merged with their neighbors if they have compatible attributes. - Code holding a range lock but not the mm_vma_lock must be prepared for the vmas at both ends of the locked range to be merged with their neighbors outside of the locked range. The easiest way to do that is to copy the vma of record into a pseudo-vma before releasing the mm_vma_lock (this is a bit kludgy and I would prefer to copy only the necessary VMA attributes, but using a pseudo-vma makes it easier to maintain this patchset out of mainline for the moment). Call sites that take a range lock usualy immediately take the mm_vma_lock next - it would probably be more efficient to collapse mm_vma_lock with the mutex that protects the range lock structures. This isn't done yet as I tried to simplify the initial implementation. In the future I would also like to remove the various workarounds we have been doing to limit mmap_sem hold times (i.e. FAULT_FLAG_ALLOW_RETRY, vm_populate and munmap downgrading to a read lock, ...) which shouldn't be necessary if the locking was only effective on the memory ranges affected by each operation. The included changes apply on top of upstream kernel v5.5. Please apply with git am -p0 - I'm not sure why my git format-patch setup requires that. Commits 1 to 6 implement a range locking API: - 1 implements coarse locking as wrappers around rwsem; - 2 converts most mmap_sem locking sites to use the new coarse locking API (using coccinelle to automate the conversion); - 3 converts remaining mmap_sem locking sites which were missed by coccinelle; - 4 extends the API to support range locking. The initial implementation still uses coarse locking (ignoring the range); but it validates that the callers use matching ranges in lock and unlock calls; - 5 prepares callers to allow for sleeping during unlock; - 6 actually implements the range locking functions. Commits 7 to 12 allow the x86 fault handler to specify a range that may be released while handling the fault: - 7 adds a range field to struct mm_fault; - 8 makes handle_mm_fault() populate that field; - 9 and 10 honor it when dropping mmap_sem during fault handling; - 11 is a cleanup to the x86 fault handler to prepare for 12; - 12 changes the x86 fault handler to use an explicit lock range. Commits 13 to 15 prepare for operating on a pseudo-vma during faults: - 13 adds a prepare_vma_fault which may update the vma of record (specifically, allocate an anon_vma) before creating the pseudo-vma; - 14 disables swap vma readahead as its implementation keeps stats in the vma; - 15 changes the x86 fault handler to use pseudo-vmas when handling anon vmas. Commits 16 and 17 implement range locking in x86 anonymous vma faults: - Commit 16 adds the vma locking API to be used to manipulate vmas when holding a fine grained ranged lock; - Commit 17 converts the x86 fault handler to use a pmd sized range lock when operating on anon vmas. Commits 18 to 20 extend the above to also work on filemap based files: - Commit 18 makes sure we release the correct range when dropping mmap_sem during filemap file access; - Commit 19 tags vm_operations that support range locking; - Commit 20 makes the x86 fault handler use fine grained ranges when faulting the supported files. Commits 21 to 24 implement range locking for the most basic mmap() case: - 21 adds a locked argument to do_mmap(); - 22 makes do_mmap acquire the mmap_sem if locked is false; - 23 converts soem easy call sites to pass locked=false; - 24 changes do_mmap to acquire a fine grained lock in the easiest case (anonymous mapping, known address, no prior existing mapping). Michel Lespinasse (24): MM locking API: initial implementation as rwsem wrappers MM locking API: use coccinelle to convert mmap_sem rwsem call sites MM locking API: manual conversion of mmap_sem call sites missed by coccinelle MM locking API: add range arguments MM locking API: allow for sleeping during unlock MM locking API: implement fine grained range locks mm/memory: add range field to struct vm_fault mm/memory: allow specifying MM lock range to handle_mm_fault() do_swap_page: use the vmf->range field when dropping mmap_sem handle_userfault: use the vmf->range field when dropping mmap_sem x86 fault handler: merge bad_area() functions x86 fault handler: use an explicit MM lock range mm/memory: add prepare_mm_fault() function mm/swap_state: disable swap vma readahead x86 fault handler: use a pseudo-vma when operating on anonymous vmas. MM locking API: add vma locking API x86 fault handler: implement range locking shared file mappings: use the vmf->range field when dropping mmap_sem mm: add field to annotate vm_operations that support range locking x86 fault handler: extend range locking to supported file vmas do_mmap: add locked argument do_mmap: implement locked argument do_mmap: use locked=false in vm_mmap_pgoff() and aio_setup_ring() do_mmap: implement easiest cases of fine grained locking arch/alpha/kernel/traps.c | 4 +- arch/alpha/mm/fault.c | 10 +- arch/arc/kernel/process.c | 4 +- arch/arc/kernel/troubleshoot.c | 4 +- arch/arc/mm/fault.c | 4 +- arch/arm/kernel/process.c | 4 +- arch/arm/kernel/swp_emulate.c | 4 +- arch/arm/lib/uaccess_with_memcpy.c | 16 +- arch/arm/mm/fault.c | 6 +- arch/arm64/kernel/traps.c | 4 +- arch/arm64/kernel/vdso.c | 8 +- arch/arm64/mm/fault.c | 8 +- arch/csky/kernel/vdso.c | 4 +- arch/csky/mm/fault.c | 8 +- arch/hexagon/kernel/vdso.c | 4 +- arch/hexagon/mm/vm_fault.c | 8 +- arch/ia64/kernel/perfmon.c | 8 +- arch/ia64/mm/fault.c | 8 +- arch/ia64/mm/init.c | 12 +- arch/m68k/kernel/sys_m68k.c | 14 +- arch/m68k/mm/fault.c | 8 +- arch/microblaze/mm/fault.c | 12 +- arch/mips/kernel/traps.c | 4 +- arch/mips/kernel/vdso.c | 4 +- arch/mips/mm/fault.c | 10 +- arch/nds32/kernel/vdso.c | 6 +- arch/nds32/mm/fault.c | 12 +- arch/nios2/mm/fault.c | 12 +- arch/nios2/mm/init.c | 4 +- arch/openrisc/mm/fault.c | 10 +- arch/parisc/kernel/traps.c | 6 +- arch/parisc/mm/fault.c | 8 +- arch/powerpc/kernel/vdso.c | 6 +- arch/powerpc/kvm/book3s_64_mmu_hv.c | 4 +- arch/powerpc/kvm/book3s_hv.c | 6 +- arch/powerpc/kvm/book3s_hv_uvmem.c | 12 +- arch/powerpc/kvm/e500_mmu_host.c | 4 +- arch/powerpc/mm/book3s64/iommu_api.c | 4 +- arch/powerpc/mm/book3s64/subpage_prot.c | 12 +- arch/powerpc/mm/copro_fault.c | 4 +- arch/powerpc/mm/fault.c | 12 +- arch/powerpc/oprofile/cell/spu_task_sync.c | 6 +- arch/powerpc/platforms/cell/spufs/file.c | 4 +- arch/riscv/kernel/vdso.c | 4 +- arch/riscv/mm/fault.c | 10 +- arch/s390/kernel/vdso.c | 4 +- arch/s390/kvm/gaccess.c | 4 +- arch/s390/kvm/kvm-s390.c | 24 +- arch/s390/kvm/priv.c | 32 +- arch/s390/mm/fault.c | 6 +- arch/s390/mm/gmap.c | 40 +- arch/s390/pci/pci_mmio.c | 4 +- arch/sh/kernel/sys_sh.c | 6 +- arch/sh/kernel/vsyscall/vsyscall.c | 4 +- arch/sh/mm/fault.c | 14 +- arch/sparc/mm/fault_32.c | 18 +- arch/sparc/mm/fault_64.c | 12 +- arch/sparc/vdso/vma.c | 4 +- arch/um/include/asm/mmu_context.h | 6 +- arch/um/kernel/tlb.c | 2 +- arch/um/kernel/trap.c | 6 +- arch/unicore32/mm/fault.c | 6 +- arch/x86/entry/vdso/vma.c | 10 +- arch/x86/kernel/tboot.c | 2 +- arch/x86/kernel/vm86_32.c | 4 +- arch/x86/kvm/mmu/paging_tmpl.h | 8 +- arch/x86/mm/debug_pagetables.c | 8 +- arch/x86/mm/fault.c | 110 ++- arch/x86/mm/mpx.c | 15 +- arch/x86/um/vdso/vma.c | 4 +- arch/xtensa/mm/fault.c | 10 +- drivers/android/binder_alloc.c | 10 +- drivers/firmware/efi/efi.c | 2 +- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 10 +- drivers/gpu/drm/amd/amdkfd/kfd_events.c | 4 +- drivers/gpu/drm/i915/gem/i915_gem_mman.c | 4 +- drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 8 +- drivers/gpu/drm/nouveau/nouveau_svm.c | 20 +- drivers/gpu/drm/radeon/radeon_cs.c | 4 +- drivers/gpu/drm/radeon/radeon_gem.c | 6 +- drivers/gpu/drm/ttm/ttm_bo_vm.c | 4 +- drivers/infiniband/core/umem.c | 6 +- drivers/infiniband/core/umem_odp.c | 10 +- drivers/infiniband/core/uverbs_main.c | 4 +- drivers/infiniband/hw/mlx4/mr.c | 4 +- drivers/infiniband/hw/qib/qib_user_pages.c | 6 +- drivers/infiniband/hw/usnic/usnic_uiom.c | 4 +- drivers/infiniband/sw/siw/siw_mem.c | 4 +- drivers/iommu/amd_iommu_v2.c | 4 +- drivers/iommu/intel-svm.c | 4 +- drivers/media/v4l2-core/videobuf-core.c | 4 +- drivers/media/v4l2-core/videobuf-dma-contig.c | 4 +- drivers/media/v4l2-core/videobuf-dma-sg.c | 4 +- drivers/misc/cxl/cxllib.c | 4 +- drivers/misc/cxl/fault.c | 4 +- drivers/misc/sgi-gru/grufault.c | 16 +- drivers/misc/sgi-gru/grufile.c | 4 +- drivers/oprofile/buffer_sync.c | 10 +- drivers/staging/kpc2000/kpc_dma/fileops.c | 4 +- drivers/tee/optee/call.c | 4 +- drivers/vfio/vfio_iommu_type1.c | 12 +- drivers/xen/gntdev.c | 4 +- drivers/xen/privcmd.c | 14 +- fs/aio.c | 16 +- fs/coredump.c | 4 +- fs/exec.c | 16 +- fs/ext4/file.c | 1 + fs/io_uring.c | 4 +- fs/proc/base.c | 18 +- fs/proc/task_mmu.c | 28 +- fs/proc/task_nommu.c | 18 +- fs/userfaultfd.c | 28 +- include/linux/hugetlb.h | 5 +- include/linux/mm.h | 56 +- include/linux/mm_lock.h | 285 ++++++++ include/linux/mm_types.h | 22 + include/linux/mm_types_task.h | 21 + include/linux/mmu_notifier.h | 5 +- include/linux/pagemap.h | 7 +- include/linux/sched.h | 2 + init/init_task.c | 1 + ipc/shm.c | 11 +- kernel/acct.c | 4 +- kernel/bpf/stackmap.c | 32 +- kernel/events/core.c | 4 +- kernel/events/uprobes.c | 16 +- kernel/exit.c | 8 +- kernel/fork.c | 17 +- kernel/futex.c | 4 +- kernel/sched/fair.c | 4 +- kernel/sys.c | 18 +- kernel/trace/trace_output.c | 4 +- mm/Kconfig | 25 + mm/Makefile | 2 + mm/filemap.c | 10 +- mm/frame_vector.c | 4 +- mm/gup.c | 20 +- mm/hugetlb.c | 13 +- mm/init-mm.c | 3 +- mm/internal.h | 2 +- mm/khugepaged.c | 37 +- mm/ksm.c | 34 +- mm/madvise.c | 18 +- mm/memcontrol.c | 8 +- mm/memory.c | 55 +- mm/mempolicy.c | 22 +- mm/migrate.c | 8 +- mm/mincore.c | 4 +- mm/mlock.c | 16 +- mm/mm_lock_range.c | 691 ++++++++++++++++++ mm/mm_lock_rwsem_checked.c | 134 ++++ mm/mmap.c | 170 +++-- mm/mmu_notifier.c | 4 +- mm/mprotect.c | 12 +- mm/mremap.c | 6 +- mm/msync.c | 8 +- mm/nommu.c | 36 +- mm/oom_kill.c | 4 +- mm/process_vm_access.c | 4 +- mm/shmem.c | 1 + mm/swap_state.c | 6 + mm/swapfile.c | 4 +- mm/userfaultfd.c | 14 +- mm/util.c | 14 +- net/ipv4/tcp.c | 4 +- net/xdp/xdp_umem.c | 4 +- virt/kvm/arm/mmu.c | 14 +- virt/kvm/async_pf.c | 4 +- virt/kvm/kvm_main.c | 8 +- 170 files changed, 2183 insertions(+), 798 deletions(-) create mode 100644 include/linux/mm_lock.h create mode 100644 mm/mm_lock_range.c create mode 100644 mm/mm_lock_rwsem_checked.c