From patchwork Mon Dec 9 22:53:18 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Hubbard X-Patchwork-Id: 11281009 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id EE84A14B7 for ; Mon, 9 Dec 2019 22:58:51 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id B9A1C20721 for ; Mon, 9 Dec 2019 22:58:51 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=nvidia.com header.i=@nvidia.com header.b="LKkHOX01" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727111AbfLIWxy (ORCPT ); Mon, 9 Dec 2019 17:53:54 -0500 Received: from hqnvemgate25.nvidia.com ([216.228.121.64]:10131 "EHLO hqnvemgate25.nvidia.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726362AbfLIWxy (ORCPT ); Mon, 9 Dec 2019 17:53:54 -0500 Received: from hqpgpgate102.nvidia.com (Not Verified[216.228.121.13]) by hqnvemgate25.nvidia.com (using TLS: TLSv1.2, DES-CBC3-SHA) id ; Mon, 09 Dec 2019 14:53:45 -0800 Received: from hqmail.nvidia.com ([172.20.161.6]) by hqpgpgate102.nvidia.com (PGP Universal service); Mon, 09 Dec 2019 14:53:51 -0800 X-PGP-Universal: processed; by hqpgpgate102.nvidia.com on Mon, 09 Dec 2019 14:53:51 -0800 Received: from HQMAIL101.nvidia.com (172.20.187.10) by HQMAIL111.nvidia.com (172.20.187.18) with Microsoft SMTP Server (TLS) id 15.0.1473.3; Mon, 9 Dec 2019 22:53:50 +0000 Received: from rnnvemgw01.nvidia.com (10.128.109.123) by HQMAIL101.nvidia.com (172.20.187.10) with Microsoft SMTP Server (TLS) id 15.0.1473.3 via Frontend Transport; Mon, 9 Dec 2019 22:53:50 +0000 Received: from blueforge.nvidia.com (Not Verified[10.110.48.28]) by rnnvemgw01.nvidia.com with Trustwave SEG (v7,5,8,10121) id ; Mon, 09 Dec 2019 14:53:50 -0800 From: John Hubbard To: Andrew Morton CC: Al Viro , Alex Williamson , Benjamin Herrenschmidt , =?utf-8?b?QmrDtnJuIFQ=?= =?utf-8?b?w7ZwZWw=?= , Christoph Hellwig , Dan Williams , Daniel Vetter , Dave Chinner , David Airlie , "David S . Miller" , Ira Weiny , Jan Kara , Jason Gunthorpe , Jens Axboe , Jonathan Corbet , =?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= , Magnus Karlsson , Mauro Carvalho Chehab , Michael Ellerman , Michal Hocko , Mike Kravetz , Paul Mackerras , Shuah Khan , Vlastimil Babka , , , , , , , , , , , , , LKML , John Hubbard Subject: [PATCH v8 00/26] mm/gup: track dma-pinned pages: FOLL_PIN Date: Mon, 9 Dec 2019 14:53:18 -0800 Message-ID: <20191209225344.99740-1-jhubbard@nvidia.com> X-Mailer: git-send-email 2.24.0 MIME-Version: 1.0 X-NVConfidentiality: public DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1; t=1575932025; bh=wxsTm/nUQAgyXOXiekXNWJzX7DswluWndSLLkdWyNsU=; h=X-PGP-Universal:From:To:CC:Subject:Date:Message-ID:X-Mailer: MIME-Version:X-NVConfidentiality:Content-Type: Content-Transfer-Encoding; b=LKkHOX01qLpN5I/n8Lng/OWQGwEhr76195x23RkvMppTWIJJbS3i9pwYiPoY5/bc5 e5lZ2j6HDv1RxnFlDmf+27mvJqvICyI/9swSG0NVHC7BwY7bXQWssErtwvybFU+CPO u+kFPOfd8BvFfqgkKvp99lbQIa4c8BLOMS5cIDbAEudwg6i2UIf/2UaPnt13+Iyjzn 0sbdp/2T7n7O7UyZ8N4n4miN8mogBm7OLnJzyzBlGrnY7d1eAMUlVAD7K5+Ep8IXKR QYsDxSG9vHnZW79dgiLsJtnrvvr42XAJPkYuzOsfQ7GjYxUczB56WLiKZW6qxNViGb CnldVfIf5duTw== Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org Hi, This implements an API naming change (put_user_page*() --> unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It extends that tracking to a few select subsystems. More subsystems will be added in follow up work. Christoph Hellwig, a couple of points of interest: a) I've moved the bulk of the code out of the inline functions, as requested, for the devmap changes (patch 4: "mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages"). b) Contrary to my earlier response to your review, I have not actually merged patch 23 ("mm/gup: pass flags arg to __gup_device_* functions") into patch 24 ("mm/gup: track FOLL_PIN pages"). This is because I suspect that it's better to avoid making patch 24 any larger and worse to review than it already is. But if you feel strongly about it, I'll combine them anyway. Changes since v7: * Rebased onto Linux 5.5-rc1 * Reworked the grab_page() and try_grab_compound_head(), for API consistency and less diffs (thanks to Jan Kara's reviews). * Added Leon Romanovsky's reviewed-by tags for two of the IB-related patches. * patch 4 refactoring changes, as mentioned above. There is a git repo and branch, for convenience: git@github.com:johnhubbard/linux.git pin_user_pages_tracking_v8 For the remaining list of "changes since version N", those are all in v7, which is here: https://lore.kernel.org/r/20191121071354.456618-1-jhubbard@nvidia.com ============================================================ Overview: This is a prerequisite to solving the problem of proper interactions between file-backed pages, and [R]DMA activities, as discussed in [1], [2], [3], and in a remarkable number of email threads since about 2017. :) A new internal gup flag, FOLL_PIN is introduced, and thoroughly documented in the last patch's Documentation/vm/pin_user_pages.rst. I believe that this will provide a good starting point for doing the layout lease work that Ira Weiny has been working on. That's because these new wrapper functions provide a clean, constrained, systematically named set of functionality that, again, is required in order to even know if a page is "dma-pinned". In contrast to earlier approaches, the page tracking can be incrementally applied to the kernel call sites that, until now, have been simply calling get_user_pages() ("gup"). In other words, opt-in by changing from this: get_user_pages() (sets FOLL_GET) put_page() to this: pin_user_pages() (sets FOLL_PIN) unpin_user_page() ============================================================ Testing: * I've done some overall kernel testing (LTP, and a few other goodies), and some directed testing to exercise some of the changes. And as you can see, gup_benchmark is enhanced to exercise this. Basically, I've been able to runtime test the core get_user_pages() and pin_user_pages() and related routines, but not so much on several of the call sites--but those are generally just a couple of lines changed, each. Not much of the kernel is actually using this, which on one hand reduces risk quite a lot. But on the other hand, testing coverage is low. So I'd love it if, in particular, the Infiniband and PowerPC folks could do a smoke test of this series for me. Runtime testing for the call sites so far is pretty light: * io_uring: Some directed tests from liburing exercise this, and they pass. * process_vm_access.c: A small directed test passes. * gup_benchmark: the enhanced version hits the new gup.c code, and passes. * infiniband: ran "ib_write_bw", which exercises the umem.c changes, but not the other changes. * VFIO: compiles (I'm vowing to set up a run time test soon, but it's not ready just yet) * powerpc: it compiles... * drm/via: compiles... * goldfish: compiles... * net/xdp: compiles... * media/v4l2: compiles... [1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/ [2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/ [3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/ Dan Williams (1): mm: Cleanup __put_devmap_managed_page() vs ->page_free() John Hubbard (25): mm/gup: factor out duplicate code from four routines mm/gup: move try_get_compound_head() to top, fix minor issues mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages goldish_pipe: rename local pin_user_pages() routine mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call mm/gup: allow FOLL_FORCE for get_user_pages_fast() IB/umem: use get_user_pages_fast() to pin DMA pages mm/gup: introduce pin_user_pages*() and FOLL_PIN goldish_pipe: convert to pin_user_pages() and put_user_page() IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote() drm/via: set FOLL_PIN via pin_user_pages_fast() fs/io_uring: set FOLL_PIN via pin_user_pages() net/xdp: set FOLL_PIN via pin_user_pages() media/v4l2-core: set pages dirty upon releasing DMA buffers media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page() conversion vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion powerpc: book3s64: convert to pin_user_pages() and put_user_page() mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1" mm, tree-wide: rename put_user_page*() to unpin_user_page*() mm/gup: pass flags arg to __gup_device_* functions mm/gup: track FOLL_PIN pages mm/gup_benchmark: support pin_user_pages() and related calls selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage Documentation/core-api/index.rst | 1 + Documentation/core-api/pin_user_pages.rst | 233 ++++++++ arch/powerpc/mm/book3s64/iommu_api.c | 12 +- drivers/gpu/drm/via/via_dmablit.c | 6 +- drivers/infiniband/core/umem.c | 19 +- drivers/infiniband/core/umem_odp.c | 13 +- drivers/infiniband/hw/hfi1/user_pages.c | 4 +- drivers/infiniband/hw/mthca/mthca_memfree.c | 8 +- drivers/infiniband/hw/qib/qib_user_pages.c | 4 +- drivers/infiniband/hw/qib/qib_user_sdma.c | 8 +- drivers/infiniband/hw/usnic/usnic_uiom.c | 4 +- drivers/infiniband/sw/siw/siw_mem.c | 4 +- drivers/media/v4l2-core/videobuf-dma-sg.c | 8 +- drivers/nvdimm/pmem.c | 6 - drivers/platform/goldfish/goldfish_pipe.c | 35 +- drivers/vfio/vfio_iommu_type1.c | 35 +- fs/io_uring.c | 6 +- include/linux/mm.h | 145 ++++- include/linux/mmzone.h | 2 + include/linux/page_ref.h | 10 + mm/gup.c | 595 +++++++++++++++----- mm/gup_benchmark.c | 74 ++- mm/huge_memory.c | 23 +- mm/hugetlb.c | 15 +- mm/memremap.c | 76 ++- mm/process_vm_access.c | 28 +- mm/swap.c | 24 + mm/vmstat.c | 2 + net/xdp/xdp_umem.c | 4 +- tools/testing/selftests/vm/gup_benchmark.c | 21 +- tools/testing/selftests/vm/run_vmtests | 22 + 31 files changed, 1093 insertions(+), 354 deletions(-) create mode 100644 Documentation/core-api/pin_user_pages.rst