[v11,00/25] mm/gup: track dma-pinned pages: FOLL_PIN

Message ID	20191216222537.491123-1-jhubbard@nvidia.com (mailing list archive)
Headers	show Return-Path: <SRS0=l7do=2G=vger.kernel.org=linux-block-owner@kernel.org> TLS: TLSv1.2, DES-CBC3-SHA) id <B5df8045b0000>; Mon, 16 Dec 2019 14:25:31 -0800 From: John Hubbard <jhubbard@nvidia.com> To: Andrew Morton <akpm@linux-foundation.org> CC: Al Viro <viro@zeniv.linux.org.uk>, Alex Williamson <alex.williamson@redhat.com>, Benjamin Herrenschmidt <benh@kernel.crashing.org>, =?utf-8?b?QmrDtnJuIFQ=?= =?utf-8?b?w7ZwZWw=?= <bjorn.topel@intel.com>, Christoph Hellwig <hch@infradead.org>, Dan Williams <dan.j.williams@intel.com>, Daniel Vetter <daniel@ffwll.ch>, Dave Chinner <david@fromorbit.com>, David Airlie <airlied@linux.ie>, "David S . Miller" <davem@davemloft.net>, Ira Weiny <ira.weiny@intel.com>, Jan Kara <jack@suse.cz>, Jason Gunthorpe <jgg@ziepe.ca>, Jens Axboe <axboe@kernel.dk>, Jonathan Corbet <corbet@lwn.net>, =?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= <jglisse@redhat.com>, Magnus Karlsson <magnus.karlsson@intel.com>, "Mauro Carvalho Chehab" <mchehab@kernel.org>, Michael Ellerman <mpe@ellerman.id.au>, Michal Hocko <mhocko@suse.com>, Mike Kravetz <mike.kravetz@oracle.com>, "Paul Mackerras" <paulus@samba.org>, Shuah Khan <shuah@kernel.org>, Vlastimil Babka <vbabka@suse.cz>, <bpf@vger.kernel.org>, <dri-devel@lists.freedesktop.org>, <kvm@vger.kernel.org>, <linux-block@vger.kernel.org>, <linux-doc@vger.kernel.org>, <linux-fsdevel@vger.kernel.org>, <linux-kselftest@vger.kernel.org>, <linux-media@vger.kernel.org>, <linux-rdma@vger.kernel.org>, <linuxppc-dev@lists.ozlabs.org>, <netdev@vger.kernel.org>, <linux-mm@kvack.org>, LKML <linux-kernel@vger.kernel.org>, John Hubbard <jhubbard@nvidia.com> Subject: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN Date: Mon, 16 Dec 2019 14:25:12 -0800 Message-ID: <20191216222537.491123-1-jhubbard@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-block-owner@vger.kernel.org Precedence: bulk
Series	mm/gup: track dma-pinned pages: FOLL_PIN \| expand [v11,00/25] mm/gup: track dma-pinned pages: FOLL_PIN [v11,01/25] mm/gup: factor out duplicate code from four routines [v11,02/25] mm/gup: move try_get_compound_head() to top, fix minor issues [v11,03/25] mm: Cleanup __put_devmap_managed_page() vs ->page_free() [v11,04/25] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages [v11,05/25] goldish_pipe: rename local pin_user_pages() routine [v11,06/25] mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM [v11,07/25] vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call [v11,08/25] mm/gup: allow FOLL_FORCE for get_user_pages_fast() [v11,09/25] IB/umem: use get_user_pages_fast() to pin DMA pages [v11,10/25] mm/gup: introduce pin_user_pages() and FOLL_PIN [v11,11/25] goldish_pipe: convert to pin_user_pages() and put_user_page() [v11,12/25] IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages(), fix up ODP [v11,13/25] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote() [v11,14/25] drm/via: set FOLL_PIN via pin_user_pages_fast() [v11,15/25] fs/io_uring: set FOLL_PIN via pin_user_pages() [v11,16/25] net/xdp: set FOLL_PIN via pin_user_pages() [v11,17/25] media/v4l2-core: set pages dirty upon releasing DMA buffers [v11,18/25] media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page() conversion [v11,19/25] vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion [v11,20/25] powerpc: book3s64: convert to pin_user_pages() and put_user_page() [v11,21/25] mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1" [v11,22/25] mm, tree-wide: rename put_user_page() to unpin_user_page() [v11,23/25] mm/gup: track FOLL_PIN pages [v11,24/25] mm/gup_benchmark: support pin_user_pages() and related calls [v11,25/25] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage

John Hubbard Dec. 16, 2019, 10:25 p.m. UTC

Hi,

This implements an API naming change (put_user_page*() -->
unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
extends that tracking to a few select subsystems. More subsystems will
be added in follow up work.

Christoph Hellwig, a point of interest:

a) I've moved the bulk of the code out of the inline functions, as
   requested, for the devmap changes (patch 4: "mm: devmap: refactor
   1-based refcounting for ZONE_DEVICE pages").

Changes since v10: Remaining fixes resulting from Jan Kara's reviews:

* Shifted to using the sign bit in page_dma_pinned() to allow accurate
  results even in the overflow case. See the comments in that routine
  for details. This allowed getting rid of the new
  page_ref_zero_or_close_to_bias_overflow(), in favor of a simple
  sign check via "page_ref_count() <= 0").

* Simplified some of the huge_memory.c changes, and simplified a gup.c
  WARN invocation.

* Now using a standard -ENOMEM for most try_grab_page() failures.

* Got rid of tabs in the comment headers (I had thought they were
  required there, but it's actually the reverse: they are not
  allowed there).

* Rebased against 5.5-rc2 and retested.

* Added Jan Kara's reviewed-by tag for patch 23 (the main patch of the
  series).

Changes since v9: Fixes resulting from Jan Kara's and Jonathan Corbet's
reviews:

* Removed reviewed-by tags from the "mm/gup: track FOLL_PIN pages" (those
  were improperly inherited from the much smaller refactoring patch that
  was merged into it).

* Made try_grab_compound_head() and try_grab_page() behavior similar in
  their behavior with flags, in order to avoid "gotchas" later.

* follow_trans_huge_pmd(): moved the try_grab_page() to earlier in the
  routine, in order to avoid having to undo mlock_vma_page().

* follow_hugetlb_page(): removed a refcount overflow check that is now
  extraneous (and weaker than what try_grab_page() provides a few lines
  further down).

* Fixed up two Documentation flaws, pointed out by Jonathan Corbet's
  review.

Changes since v8:

* Merged the "mm/gup: pass flags arg to __gup_device_* functions" patch
  into the "mm/gup: track FOLL_PIN pages" patch, as requested by
  Christoph and Jan.

* Changed void grab_page() to bool try_grab_page(), and handled errors
  at the call sites. (From Jan's review comments.) try_grab_page()
  attempts to avoid page refcount overflows, even when counting up with
  GUP_PIN_COUNTING_BIAS increments.

* Fixed a bug that I'd introduced, when changing a BUG() to a WARN().

* Added Jan's reviewed-by tag to the " mm/gup: allow FOLL_FORCE for
  get_user_pages_fast()" patch.

* Documentation: pin_user_pages.rst: fixed an incorrect gup_benchmark
  invocation, left over from the pin_longterm days, spotted while preparing
  this version.

* Rebased onto today's linux.git (-rc1), and re-tested.

Changes since v7:

* Rebased onto Linux 5.5-rc1

* Reworked the grab_page() and try_grab_compound_head(), for API
  consistency and less diffs (thanks to Jan Kara's reviews).

* Added Leon Romanovsky's reviewed-by tags for two of the IB-related
  patches.

* patch 4 refactoring changes, as mentioned above.

There is a git repo and branch, for convenience:

    git@github.com:johnhubbard/linux.git pin_user_pages_tracking_v8

For the remaining list of "changes since version N", those are all in
v7, which is here:

  https://lore.kernel.org/r/20191121071354.456618-1-jhubbard@nvidia.com

============================================================
Overview:

This is a prerequisite to solving the problem of proper interactions
between file-backed pages, and [R]DMA activities, as discussed in [1],
[2], [3], and in a remarkable number of email threads since about
2017. :)

A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.

I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".

In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:

    get_user_pages() (sets FOLL_GET)
    put_page()

to this:
    pin_user_pages() (sets FOLL_PIN)
    unpin_user_page()

============================================================
Testing:

* I've done some overall kernel testing (LTP, and a few other goodies),
  and some directed testing to exercise some of the changes. And as you
  can see, gup_benchmark is enhanced to exercise this. Basically, I've
  been able to runtime test the core get_user_pages() and
  pin_user_pages() and related routines, but not so much on several of
  the call sites--but those are generally just a couple of lines
  changed, each.

  Not much of the kernel is actually using this, which on one hand
  reduces risk quite a lot. But on the other hand, testing coverage
  is low. So I'd love it if, in particular, the Infiniband and PowerPC
  folks could do a smoke test of this series for me.

  Runtime testing for the call sites so far is pretty light:

    * io_uring: Some directed tests from liburing exercise this, and
                they pass.
    * process_vm_access.c: A small directed test passes.
    * gup_benchmark: the enhanced version hits the new gup.c code, and
                     passes.
    * infiniband: ran "ib_write_bw", which exercises the umem.c changes,
                  but not the other changes.
    * VFIO: compiles (I'm vowing to set up a run time test soon, but it's
                      not ready just yet)
    * powerpc: it compiles...
    * drm/via: compiles...
    * goldfish: compiles...
    * net/xdp: compiles...
    * media/v4l2: compiles...

[1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/

Dan Williams (1):
  mm: Cleanup __put_devmap_managed_page() vs ->page_free()

John Hubbard (24):
  mm/gup: factor out duplicate code from four routines
  mm/gup: move try_get_compound_head() to top, fix minor issues
  mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
  goldish_pipe: rename local pin_user_pages() routine
  mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM
  vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call
  mm/gup: allow FOLL_FORCE for get_user_pages_fast()
  IB/umem: use get_user_pages_fast() to pin DMA pages
  mm/gup: introduce pin_user_pages*() and FOLL_PIN
  goldish_pipe: convert to pin_user_pages() and put_user_page()
  IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP
  mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
  drm/via: set FOLL_PIN via pin_user_pages_fast()
  fs/io_uring: set FOLL_PIN via pin_user_pages()
  net/xdp: set FOLL_PIN via pin_user_pages()
  media/v4l2-core: set pages dirty upon releasing DMA buffers
  media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page()
    conversion
  vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion
  powerpc: book3s64: convert to pin_user_pages() and put_user_page()
  mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding
    "1"
  mm, tree-wide: rename put_user_page*() to unpin_user_page*()
  mm/gup: track FOLL_PIN pages
  mm/gup_benchmark: support pin_user_pages() and related calls
  selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN
    coverage

 Documentation/core-api/index.rst            |   1 +
 Documentation/core-api/pin_user_pages.rst   | 232 ++++++++
 arch/powerpc/mm/book3s64/iommu_api.c        |  10 +-
 drivers/gpu/drm/via/via_dmablit.c           |   6 +-
 drivers/infiniband/core/umem.c              |  19 +-
 drivers/infiniband/core/umem_odp.c          |  13 +-
 drivers/infiniband/hw/hfi1/user_pages.c     |   4 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |   8 +-
 drivers/infiniband/hw/qib/qib_user_pages.c  |   4 +-
 drivers/infiniband/hw/qib/qib_user_sdma.c   |   8 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c    |   4 +-
 drivers/infiniband/sw/siw/siw_mem.c         |   4 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c   |   8 +-
 drivers/nvdimm/pmem.c                       |   6 -
 drivers/platform/goldfish/goldfish_pipe.c   |  35 +-
 drivers/vfio/vfio_iommu_type1.c             |  35 +-
 fs/io_uring.c                               |   6 +-
 include/linux/mm.h                          | 155 ++++-
 include/linux/mmzone.h                      |   2 +
 include/linux/page_ref.h                    |  10 +
 mm/gup.c                                    | 626 +++++++++++++++-----
 mm/gup_benchmark.c                          |  74 ++-
 mm/huge_memory.c                            |  29 +-
 mm/hugetlb.c                                |  38 +-
 mm/memremap.c                               |  76 ++-
 mm/process_vm_access.c                      |  28 +-
 mm/swap.c                                   |  24 +
 mm/vmstat.c                                 |   2 +
 net/xdp/xdp_umem.c                          |   4 +-
 tools/testing/selftests/vm/gup_benchmark.c  |  21 +-
 tools/testing/selftests/vm/run_vmtests      |  22 +
 31 files changed, 1145 insertions(+), 369 deletions(-)
 create mode 100644 Documentation/core-api/pin_user_pages.rst

--
2.24.1

Jan Kara Dec. 17, 2019, 7:39 a.m. UTC | #1

Hi!

On Mon 16-12-19 14:25:12, John Hubbard wrote:
> Hi,
> 
> This implements an API naming change (put_user_page*() -->
> unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> extends that tracking to a few select subsystems. More subsystems will
> be added in follow up work.

Just a note for Andrew and others watching this series: At this point I'm fine
with the series so if someone still has some review feedback or wants to
check the series, now is the right time. Otherwise I think Andrew can push
the series to MM tree so that it will get wider testing exposure and is
prepared for the next merge window.

								Honza

Leon Romanovsky Dec. 19, 2019, 1:26 p.m. UTC | #2

On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
> Hi,
>
> This implements an API naming change (put_user_page*() -->
> unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> extends that tracking to a few select subsystems. More subsystems will
> be added in follow up work.

Hi John,

The patchset generates kernel panics in our IB testing. In our tests, we
allocated single memory block and registered multiple MRs using the single
block.

The possible bad flow is:
 ib_umem_geti() ->
  pin_user_pages_fast(FOLL_WRITE) ->
   internal_get_user_pages_fast(FOLL_WRITE) ->
    gup_pgd_range() ->
     gup_huge_pd() ->
      gup_hugepte() ->
       try_grab_compound_head() ->

 108 static __maybe_unused struct page *try_grab_compound_head(struct page *page,
 109                                                           int refs,
 110                                                           unsigned int flags)
 111 {
 112         if (flags & FOLL_GET)
 113                 return try_get_compound_head(page, refs);
 114         else if (flags & FOLL_PIN)
 115                 return try_pin_compound_head(page, refs);
 116
 117         WARN_ON_ONCE(1);
 118         return NULL;
 119 }

# (master) $ dmesg
[10924.722220] mlx5_core 0000:00:08.0 eth2: Link up
[10924.725383] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[10960.902254] ------------[ cut here ]------------
[10960.905614] WARNING: CPU: 3 PID: 8838 at mm/gup.c:61 try_grab_compound_head+0x92/0xd0
[10960.907313] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp rpcrdma rdma_ucm ib_iser ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core kvm_intel mlx5_core rfkill mlxfw sunrpc virtio_net pci_hyperv_intf kvm irqbypass net_failover crc32_pclmul i2c_piix4 ptp crc32c_intel failover pcspkr ghash_clmulni_intel i2c_core pps_core sch_fq_codel ip_tables ata_generic pata_acpi serio_raw ata_piix floppy [last unloaded: mlxkvl]
[10960.917806] CPU: 3 PID: 8838 Comm: consume_mtts Tainted: G           OE     5.5.0-rc2-for-upstream-perf-2019-12-18_10-06-50-78 #1
[10960.920530] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[10960.923024] RIP: 0010:try_grab_compound_head+0x92/0xd0
[10960.924329] Code: e4 8d 14 06 48 8d 4f 34 f0 0f b1 57 34 0f 94 c2 84 d2 75 cb 85 c0 74 cd 8d 14 06 f0 0f b1 11 0f 94 c2 84 d2 75 b9 66 90 eb ea <0f> 0b 31 ff eb b7 85 c0 66 0f 1f 44 00 00 74 ab 8d 14 06 f0 0f b1
[10960.928512] RSP: 0018:ffffc9000129f880 EFLAGS: 00010082
[10960.929831] RAX: 0000000080000001 RBX: 00007f6397446000 RCX: 000fffffffe00000
[10960.931422] RDX: 0000000000040000 RSI: 0000000000011800 RDI: ffffea000f5d8000
[10960.933005] RBP: ffffc9000129f93c R08: ffffc9000129f93c R09: 0000000000200000
[10960.934584] R10: ffff88840774b200 R11: ffff888000000230 R12: 00007f6397446000
[10960.936212] R13: 0000000000000046 R14: 80000003d76000e7 R15: 0000000000000080
[10960.937793] FS:  00007f63a0590740(0000) GS:ffff88842f980000(0000) knlGS:0000000000000000
[10960.939962] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10960.941367] CR2: 00000000023e9008 CR3: 0000000406d0a002 CR4: 00000000007606e0
[10960.942975] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[10960.944654] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[10960.946394] PKRU: 55555554
[10960.947310] Call Trace:
[10960.948193]  gup_pgd_range+0x61e/0x950
[10960.949585]  internal_get_user_pages_fast+0x98/0x1c0
[10960.951313]  ib_umem_get+0x2b3/0x5a0 [ib_uverbs]
[10960.952929]  mr_umem_get+0xd8/0x280 [mlx5_ib]
[10960.954150]  ? xas_store+0x49/0x550
[10960.955187]  mlx5_ib_reg_user_mr+0x149/0x7a0 [mlx5_ib]
[10960.956478]  ? xas_load+0x9/0x80
[10960.957474]  ? xa_load+0x54/0x90
[10960.958465]  ? lookup_get_idr_uobject.part.10+0x12/0x80 [ib_uverbs]
[10960.959926]  ib_uverbs_reg_mr+0x138/0x2a0 [ib_uverbs]
[10960.961192]  ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xb1/0xf0 [ib_uverbs]
[10960.963208]  ib_uverbs_cmd_verbs.isra.8+0x997/0xb30 [ib_uverbs]
[10960.964603]  ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs]
[10960.965949]  ? mem_cgroup_commit_charge+0x6a/0x140
[10960.967177]  ? page_add_new_anon_rmap+0x58/0xc0
[10960.968360]  ib_uverbs_ioctl+0xbc/0x130 [ib_uverbs]
[10960.969595]  do_vfs_ioctl+0xa6/0x640
[10960.970631]  ? syscall_trace_enter+0x1f8/0x2e0
[10960.971829]  ksys_ioctl+0x60/0x90
[10960.972825]  __x64_sys_ioctl+0x16/0x20
[10960.973888]  do_syscall_64+0x48/0x130
[10960.974949]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[10960.976219] RIP: 0033:0x7f639fe9b267
[10960.977260] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48
[10960.981413] RSP: 002b:00007fff5335ca08 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[10960.983472] RAX: ffffffffffffffda RBX: 00007fff5335ca98 RCX: 00007f639fe9b267
[10960.985037] RDX: 00007fff5335ca80 RSI: 00000000c0181b01 RDI: 0000000000000003
[10960.986603] RBP: 00007fff5335ca60 R08: 0000000000000003 R09: 00007f63a055e010
[10960.988194] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f63a055e150
[10960.989903] R13: 00007fff5335ca60 R14: 00007fff5335cc38 R15: 00007f6397246000
[10960.991544] ---[ end trace 1f0ee07a75a16a93 ]---
[10960.992773] ------------[ cut here ]------------
[10960.993995] WARNING: CPU: 3 PID: 8838 at mm/gup.c:150 try_grab_page+0x55/0x70
[10960.995758] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp rpcrdma rdma_ucm ib_iser ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core kvm_intel mlx5_core rfkill mlxfw sunrpc virtio_net pci_hyperv_intf kvm irqbypass net_failover crc32_pclmul i2c_piix4 ptp crc32c_intel failover pcspkr ghash_clmulni_intel i2c_core pps_core sch_fq_codel ip_tables ata_generic pata_acpi serio_raw ata_piix floppy [last unloaded: mlxkvl]
[10961.008579] CPU: 3 PID: 8838 Comm: consume_mtts Tainted: G        W  OE     5.5.0-rc2-for-upstream-perf-2019-12-18_10-06-50-78 #1
[10961.011416] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[10961.013766] RIP: 0010:try_grab_page+0x55/0x70
[10961.014921] Code: 00 04 00 00 b8 01 00 00 00 f3 c3 48 8b 47 08 a8 01 75 1c 8b 47 34 85 c0 7e 1d f0 ff 47 34 b8 01 00 00 00 c3 48 8d 78 ff eb cb <0f> 0b 31 c0 c3 48 8d 78 ff 66 90 eb dc 0f 0b 31 c0 c3 66 0f 1f 84
[10961.019058] RSP: 0018:ffffc9000129f7e8 EFLAGS: 00010282
[10961.020351] RAX: 0000000080000001 RBX: 0000000000050201 RCX: 000000000f5d8000
[10961.021921] RDX: 000ffffffffff000 RSI: 0000000000040000 RDI: ffffea000f5d8000
[10961.023494] RBP: 00007f6397400000 R08: ffffea000f986cc0 R09: ffff8883c758bdd0
[10961.025067] R10: 0000000000000001 R11: ffff888000000230 R12: ffff888407701c00
[10961.026637] R13: ffff8883e61b35d0 R14: ffffea000f5d8000 R15: 0000000000050201
[10961.028217] FS:  00007f63a0590740(0000) GS:ffff88842f980000(0000) knlGS:0000000000000000
[10961.030353] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10961.031721] CR2: 00000000023e9008 CR3: 0000000406d0a002 CR4: 00000000007606e0
[10961.033305] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[10961.034884] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[10961.036456] PKRU: 55555554
[10961.037369] Call Trace:
[10961.038285]  follow_trans_huge_pmd+0x10c/0x300
[10961.039555]  follow_page_mask+0x64a/0x760
[10961.040762]  __get_user_pages+0xf1/0x720
[10961.041851]  ? apic_timer_interrupt+0xa/0x20
[10961.042996]  internal_get_user_pages_fast+0x14b/0x1c0
[10961.044266]  ib_umem_get+0x2b3/0x5a0 [ib_uverbs]
[10961.045474]  mr_umem_get+0xd8/0x280 [mlx5_ib]
[10961.046652]  ? xas_store+0x49/0x550
[10961.047696]  mlx5_ib_reg_user_mr+0x149/0x7a0 [mlx5_ib]
[10961.048967]  ? xas_load+0x9/0x80
[10961.049949]  ? xa_load+0x54/0x90
[10961.050935]  ? lookup_get_idr_uobject.part.10+0x12/0x80 [ib_uverbs]
[10961.052378]  ib_uverbs_reg_mr+0x138/0x2a0 [ib_uverbs]
[10961.053635]  ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xb1/0xf0 [ib_uverbs]
[10961.055646]  ib_uverbs_cmd_verbs.isra.8+0x997/0xb30 [ib_uverbs]
[10961.057033]  ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs]
[10961.058381]  ? mem_cgroup_commit_charge+0x6a/0x140
[10961.059611]  ? page_add_new_anon_rmap+0x58/0xc0
[10961.060796]  ib_uverbs_ioctl+0xbc/0x130 [ib_uverbs]
[10961.062034]  do_vfs_ioctl+0xa6/0x640
[10961.063081]  ? syscall_trace_enter+0x1f8/0x2e0
[10961.064253]  ksys_ioctl+0x60/0x90
[10961.065252]  __x64_sys_ioctl+0x16/0x20
[10961.066315]  do_syscall_64+0x48/0x130
[10961.067382]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[10961.068647] RIP: 0033:0x7f639fe9b267
[10961.069691] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48
[10961.073882] RSP: 002b:00007fff5335ca08 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[10961.075949] RAX: ffffffffffffffda RBX: 00007fff5335ca98 RCX: 00007f639fe9b267
[10961.077545] RDX: 00007fff5335ca80 RSI: 00000000c0181b01 RDI: 0000000000000003
[10961.079128] RBP: 00007fff5335ca60 R08: 0000000000000003 R09: 00007f63a055e010
[10961.080709] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f63a055e150
[10961.082278] R13: 00007fff5335ca60 R14: 00007fff5335cc38 R15: 00007f6397246000
[10961.083873] ---[ end trace 1f0ee07a75a16a94 ]---

Thanks

John Hubbard Dec. 19, 2019, 8:30 p.m. UTC | #3

On 12/19/19 5:26 AM, Leon Romanovsky wrote:
> On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
>> Hi,
>>
>> This implements an API naming change (put_user_page*() -->
>> unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
>> extends that tracking to a few select subsystems. More subsystems will
>> be added in follow up work.
> 
> Hi John,
> 
> The patchset generates kernel panics in our IB testing. In our tests, we
> allocated single memory block and registered multiple MRs using the single
> block.
> 
> The possible bad flow is:
>   ib_umem_geti() ->
>    pin_user_pages_fast(FOLL_WRITE) ->
>     internal_get_user_pages_fast(FOLL_WRITE) ->
>      gup_pgd_range() ->
>       gup_huge_pd() ->
>        gup_hugepte() ->
>         try_grab_compound_head() ->

Hi Leon,

Thanks very much for the detailed report! So we're overflowing...

At first look, this seems likely to be hitting a weak point in the
GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
(there's a writeup in Documentation/core-api/pin_user_page.rst, lines
99-121). Basically it's pretty easy to overflow the page->_refcount
with huge pages if the pages have a *lot* of subpages.

We can only do about 7 pins on 1GB huge pages that use 4KB subpages.
Do you have any idea how many pins (repeated pins on the same page, which
it sounds like you have) might be involved in your test case,
and the huge page and system page sizes? That would allow calculating
if we're likely overflowing for that reason.

So, ideas and next steps:

1. Assuming that you *are* hitting this, I think I may have to fall back to
implementing the "deferred" part of this design, as part of this series, after
all. That means:

   For the pin/unpin calls at least, stop treating all pages as if they are
   a cluster of PAGE_SIZE pages; instead, retrieve a huge page as one page.
   That's not how it works now, and the need to hand back a huge array of
   subpages is part of the problem. This affects the callers too, so it's not
   a super quick change to make. (I was really hoping not to have to do this
   yet.)

2. OR, maybe if you're *close* the the overflow, I could buy some development
time by moving the boundary between pinned vs get_page() refcounts, by
reducing GUP_PIN_COUNTING_BIAS. That's less robust, but I don't want
to rule it out just yet. After all, 1024 is a big chunk to count up with,
and even if get_page() calls add up to, say, 512 refs on a page, it's still
just a false positive on page_dma_pinned(). And false positives, if transient,
are OK.

3. It would be nice if I could reproduce this. I have a two-node mlx5 Infiniband
test setup, but I have done only the tiniest bit of user space IB coding, so
if you have any test programs that aren't too hard to deal with that could
possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind
that I'm not an advanced IB programmer. At all. :)

4. (minor note to self) This also uncovers a minor weakness in diagnostics:
there's no page dump in these reports, because I chickened out and didn't
include my WARN_ONCE_PAGE() macro that would have provided it. Although,
even without it, it's obvious that this is a page overflow.

thanks,

Jason Gunthorpe Dec. 19, 2019, 9:07 p.m. UTC | #4

On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote:
> On 12/19/19 5:26 AM, Leon Romanovsky wrote:
> > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
> > > Hi,
> > > 
> > > This implements an API naming change (put_user_page*() -->
> > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> > > extends that tracking to a few select subsystems. More subsystems will
> > > be added in follow up work.
> > 
> > Hi John,
> > 
> > The patchset generates kernel panics in our IB testing. In our tests, we
> > allocated single memory block and registered multiple MRs using the single
> > block.
> > 
> > The possible bad flow is:
> >   ib_umem_geti() ->
> >    pin_user_pages_fast(FOLL_WRITE) ->
> >     internal_get_user_pages_fast(FOLL_WRITE) ->
> >      gup_pgd_range() ->
> >       gup_huge_pd() ->
> >        gup_hugepte() ->
> >         try_grab_compound_head() ->
> 
> Hi Leon,
> 
> Thanks very much for the detailed report! So we're overflowing...
> 
> At first look, this seems likely to be hitting a weak point in the
> GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
> (there's a writeup in Documentation/core-api/pin_user_page.rst, lines
> 99-121). Basically it's pretty easy to overflow the page->_refcount
> with huge pages if the pages have a *lot* of subpages.
> 
> We can only do about 7 pins on 1GB huge pages that use 4KB subpages.

Considering that establishing these pins is entirely under user
control, we can't have a limit here.

If the number of allowed pins are exhausted then the
pin_user_pages_fast() must fail back to the user.

> 3. It would be nice if I could reproduce this. I have a two-node mlx5 Infiniband
> test setup, but I have done only the tiniest bit of user space IB coding, so
> if you have any test programs that aren't too hard to deal with that could
> possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind
> that I'm not an advanced IB programmer. At all. :)

Clone this:

https://github.com/linux-rdma/rdma-core.git

Install all the required deps to build it (notably cython), see the README.md

$ ./build.sh
$ build/bin/run_tests.py 

If you get things that far I think Leon can get a reproduction for you

Jason

John Hubbard Dec. 19, 2019, 9:13 p.m. UTC | #5

On 12/19/19 1:07 PM, Jason Gunthorpe wrote:
> On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote:
>> On 12/19/19 5:26 AM, Leon Romanovsky wrote:
>>> On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
>>>> Hi,
>>>>
>>>> This implements an API naming change (put_user_page*() -->
>>>> unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
>>>> extends that tracking to a few select subsystems. More subsystems will
>>>> be added in follow up work.
>>>
>>> Hi John,
>>>
>>> The patchset generates kernel panics in our IB testing. In our tests, we
>>> allocated single memory block and registered multiple MRs using the single
>>> block.
>>>
>>> The possible bad flow is:
>>>    ib_umem_geti() ->
>>>     pin_user_pages_fast(FOLL_WRITE) ->
>>>      internal_get_user_pages_fast(FOLL_WRITE) ->
>>>       gup_pgd_range() ->
>>>        gup_huge_pd() ->
>>>         gup_hugepte() ->
>>>          try_grab_compound_head() ->
>>
>> Hi Leon,
>>
>> Thanks very much for the detailed report! So we're overflowing...
>>
>> At first look, this seems likely to be hitting a weak point in the
>> GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
>> (there's a writeup in Documentation/core-api/pin_user_page.rst, lines
>> 99-121). Basically it's pretty easy to overflow the page->_refcount
>> with huge pages if the pages have a *lot* of subpages.
>>
>> We can only do about 7 pins on 1GB huge pages that use 4KB subpages.
> 
> Considering that establishing these pins is entirely under user
> control, we can't have a limit here.

There's already a limit, it's just a much larger one. :) What does "no limit"
really mean, numerically, to you in this case?

> 
> If the number of allowed pins are exhausted then the
> pin_user_pages_fast() must fail back to the user.


I'll poke around the IB call stack and see how much of that return path
is in place, if any. Because it's the same situation for get_user_pages_fast().
This code just added a warning on overflow so we could spot it early.

> 
>> 3. It would be nice if I could reproduce this. I have a two-node mlx5 Infiniband
>> test setup, but I have done only the tiniest bit of user space IB coding, so
>> if you have any test programs that aren't too hard to deal with that could
>> possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind
>> that I'm not an advanced IB programmer. At all. :)
> 
> Clone this:
> 
> https://github.com/linux-rdma/rdma-core.git
> 
> Install all the required deps to build it (notably cython), see the README.md
> 
> $ ./build.sh
> $ build/bin/run_tests.py
> 
> If you get things that far I think Leon can get a reproduction for you
> 

OK, here goes.

thanks,

John Hubbard Dec. 19, 2019, 10:58 p.m. UTC | #6

On 12/19/19 1:07 PM, Jason Gunthorpe wrote:
...
>> 3. It would be nice if I could reproduce this. I have a two-node mlx5 Infiniband
>> test setup, but I have done only the tiniest bit of user space IB coding, so
>> if you have any test programs that aren't too hard to deal with that could
>> possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind
>> that I'm not an advanced IB programmer. At all. :)
> 
> Clone this:
> 
> https://github.com/linux-rdma/rdma-core.git
> 
> Install all the required deps to build it (notably cython), see the README.md
> 
> $ ./build.sh
> $ build/bin/run_tests.py
> 
> If you get things that far I think Leon can get a reproduction for you
> 

Cool, it's up and running (1 failure, 3 skipped, out of 67 tests).

This is a great test suite to have running, I'll add it to my scripts. Here's the
full output in case the failure or skip cases are a problem:

$ sudo ./build/bin/run_tests.py --verbose

test_create_ah (tests.test_addr.AHTest) ... ok
test_create_ah_roce (tests.test_addr.AHTest) ... skipped "Can't run RoCE tests on IB link layer"
test_destroy_ah (tests.test_addr.AHTest) ... ok
test_create_comp_channel (tests.test_cq.CCTest) ... ok
test_destroy_comp_channel (tests.test_cq.CCTest) ... ok
test_create_cq_ex (tests.test_cq.CQEXTest) ... ok
test_create_cq_ex_bad_flow (tests.test_cq.CQEXTest) ... ok
test_destroy_cq_ex (tests.test_cq.CQEXTest) ... ok
test_create_cq (tests.test_cq.CQTest) ... ok
test_create_cq_bad_flow (tests.test_cq.CQTest) ... ok
test_destroy_cq (tests.test_cq.CQTest) ... ok
test_rc_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok
test_ud_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok
test_xrc_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok
test_create_dm (tests.test_device.DMTest) ... ok
test_create_dm_bad_flow (tests.test_device.DMTest) ... ok
test_destroy_dm (tests.test_device.DMTest) ... ok
test_destroy_dm_bad_flow (tests.test_device.DMTest) ... ok
test_dm_read (tests.test_device.DMTest) ... ok
test_dm_write (tests.test_device.DMTest) ... ok
test_dm_write_bad_flow (tests.test_device.DMTest) ... ok
test_dev_list (tests.test_device.DeviceTest) ... ok
test_open_dev (tests.test_device.DeviceTest) ... ok
test_query_device (tests.test_device.DeviceTest) ... ok
test_query_device_ex (tests.test_device.DeviceTest) ... ok
test_query_gid (tests.test_device.DeviceTest) ... ok
test_query_port (tests.test_device.DeviceTest) ... FAIL
test_query_port_bad_flow (tests.test_device.DeviceTest) ... ok
test_create_dm_mr (tests.test_mr.DMMRTest) ... ok
test_destroy_dm_mr (tests.test_mr.DMMRTest) ... ok
test_buffer (tests.test_mr.MRTest) ... ok
test_dereg_mr (tests.test_mr.MRTest) ... ok
test_dereg_mr_twice (tests.test_mr.MRTest) ... ok
test_lkey (tests.test_mr.MRTest) ... ok
test_read (tests.test_mr.MRTest) ... ok
test_reg_mr (tests.test_mr.MRTest) ... ok
test_reg_mr_bad_flags (tests.test_mr.MRTest) ... ok
test_reg_mr_bad_flow (tests.test_mr.MRTest) ... ok
test_rkey (tests.test_mr.MRTest) ... ok
test_write (tests.test_mr.MRTest) ... ok
test_dereg_mw_type1 (tests.test_mr.MWTest) ... ok
test_dereg_mw_type2 (tests.test_mr.MWTest) ... ok
test_reg_mw_type1 (tests.test_mr.MWTest) ... ok
test_reg_mw_type2 (tests.test_mr.MWTest) ... ok
test_reg_mw_wrong_type (tests.test_mr.MWTest) ... ok
test_odp_rc_traffic (tests.test_odp.OdpTestCase) ... ok
test_odp_ud_traffic (tests.test_odp.OdpTestCase) ... skipped 'ODP is not supported - ODP recv not supported'
test_odp_xrc_traffic (tests.test_odp.OdpTestCase) ... ok
test_default_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok
test_mem_align_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok
test_without_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok
test_alloc_pd (tests.test_pd.PDTest) ... ok
test_create_pd_none_ctx (tests.test_pd.PDTest) ... ok
test_dealloc_pd (tests.test_pd.PDTest) ... ok
test_destroy_pd_twice (tests.test_pd.PDTest) ... ok
test_multiple_pd_creation (tests.test_pd.PDTest) ... ok
test_create_qp_ex_no_attr (tests.test_qp.QPTest) ... ok
test_create_qp_ex_no_attr_connected (tests.test_qp.QPTest) ... ok
test_create_qp_ex_with_attr (tests.test_qp.QPTest) ... ok
test_create_qp_ex_with_attr_connected (tests.test_qp.QPTest) ... ok
test_create_qp_no_attr (tests.test_qp.QPTest) ... ok
test_create_qp_no_attr_connected (tests.test_qp.QPTest) ... ok
test_create_qp_with_attr (tests.test_qp.QPTest) ... ok
test_create_qp_with_attr_connected (tests.test_qp.QPTest) ... ok
test_modify_qp (tests.test_qp.QPTest) ... ok
test_query_qp (tests.test_qp.QPTest) ... ok
test_rdmacm_sync_traffic (tests.test_rdmacm.CMTestCase) ... skipped 'No devices with net interface'

======================================================================
FAIL: test_query_port (tests.test_device.DeviceTest)
----------------------------------------------------------------------
Traceback (most recent call last):
   File "/kernel_work/rdma-core/tests/test_device.py", line 129, in test_query_port
     self.verify_port_attr(port_attr)
   File "/kernel_work/rdma-core/tests/test_device.py", line 113, in verify_port_attr
     assert 'Invalid' not in d.speed_to_str(attr.active_speed)
AssertionError

----------------------------------------------------------------------
Ran 67 tests in 10.058s

FAILED (failures=1, skipped=3)

thanks,

Jan Kara Dec. 20, 2019, 9:21 a.m. UTC | #7

On Thu 19-12-19 12:30:31, John Hubbard wrote:
> On 12/19/19 5:26 AM, Leon Romanovsky wrote:
> > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
> > > Hi,
> > > 
> > > This implements an API naming change (put_user_page*() -->
> > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> > > extends that tracking to a few select subsystems. More subsystems will
> > > be added in follow up work.
> > 
> > Hi John,
> > 
> > The patchset generates kernel panics in our IB testing. In our tests, we
> > allocated single memory block and registered multiple MRs using the single
> > block.
> > 
> > The possible bad flow is:
> >   ib_umem_geti() ->
> >    pin_user_pages_fast(FOLL_WRITE) ->
> >     internal_get_user_pages_fast(FOLL_WRITE) ->
> >      gup_pgd_range() ->
> >       gup_huge_pd() ->
> >        gup_hugepte() ->
> >         try_grab_compound_head() ->
> 
> Hi Leon,
> 
> Thanks very much for the detailed report! So we're overflowing...
> 
> At first look, this seems likely to be hitting a weak point in the
> GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
> (there's a writeup in Documentation/core-api/pin_user_page.rst, lines
> 99-121). Basically it's pretty easy to overflow the page->_refcount
> with huge pages if the pages have a *lot* of subpages.
> 
> We can only do about 7 pins on 1GB huge pages that use 4KB subpages.
> Do you have any idea how many pins (repeated pins on the same page, which
> it sounds like you have) might be involved in your test case,
> and the huge page and system page sizes? That would allow calculating
> if we're likely overflowing for that reason.
> 
> So, ideas and next steps:
> 
> 1. Assuming that you *are* hitting this, I think I may have to fall back to
> implementing the "deferred" part of this design, as part of this series, after
> all. That means:
> 
>   For the pin/unpin calls at least, stop treating all pages as if they are
>   a cluster of PAGE_SIZE pages; instead, retrieve a huge page as one page.
>   That's not how it works now, and the need to hand back a huge array of
>   subpages is part of the problem. This affects the callers too, so it's not
>   a super quick change to make. (I was really hoping not to have to do this
>   yet.)

Does that mean that you would need to make all GUP users huge page aware?
Otherwise I don't see how what you suggest would work... And I don't think
making all GUP users huge page aware is realistic (effort-wise) or even
wanted (maintenance overhead in all those places).

I believe there might be also a different solution for this: For
transparent huge pages, we could find a space in 'struct page' of the
second page in the huge page for proper pin counter and just account pins
there so we'd have full width of 32-bits for it.

								Honza

Jason Gunthorpe Dec. 20, 2019, 1:34 p.m. UTC | #8

On Thu, Dec 19, 2019 at 01:13:54PM -0800, John Hubbard wrote:
> On 12/19/19 1:07 PM, Jason Gunthorpe wrote:
> > On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote:
> > > On 12/19/19 5:26 AM, Leon Romanovsky wrote:
> > > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
> > > > > Hi,
> > > > > 
> > > > > This implements an API naming change (put_user_page*() -->
> > > > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> > > > > extends that tracking to a few select subsystems. More subsystems will
> > > > > be added in follow up work.
> > > > 
> > > > Hi John,
> > > > 
> > > > The patchset generates kernel panics in our IB testing. In our tests, we
> > > > allocated single memory block and registered multiple MRs using the single
> > > > block.
> > > > 
> > > > The possible bad flow is:
> > > >    ib_umem_geti() ->
> > > >     pin_user_pages_fast(FOLL_WRITE) ->
> > > >      internal_get_user_pages_fast(FOLL_WRITE) ->
> > > >       gup_pgd_range() ->
> > > >        gup_huge_pd() ->
> > > >         gup_hugepte() ->
> > > >          try_grab_compound_head() ->
> > > 
> > > Hi Leon,
> > > 
> > > Thanks very much for the detailed report! So we're overflowing...
> > > 
> > > At first look, this seems likely to be hitting a weak point in the
> > > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
> > > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines
> > > 99-121). Basically it's pretty easy to overflow the page->_refcount
> > > with huge pages if the pages have a *lot* of subpages.
> > > 
> > > We can only do about 7 pins on 1GB huge pages that use 4KB subpages.
> > 
> > Considering that establishing these pins is entirely under user
> > control, we can't have a limit here.
> 
> There's already a limit, it's just a much larger one. :) What does "no limit"
> really mean, numerically, to you in this case?

I guess I mean 'hidden limit' - hitting the limit and failing would
be managable.

I think 7 is probably too low though, but we are not using 1GB huge
pages, only 2M..

> > If the number of allowed pins are exhausted then the
> > pin_user_pages_fast() must fail back to the user.
> 
> I'll poke around the IB call stack and see how much of that return
> path is in place, if any. Because it's the same situation for
> get_user_pages_fast().  This code just added a warning on overflow
> so we could spot it early.

All GUP callers must be prepared for failure, IB should be fine...

Jason

Leon Romanovsky Dec. 20, 2019, 6:29 p.m. UTC | #9

On Thu, Dec 19, 2019 at 05:07:43PM -0400, Jason Gunthorpe wrote:
> On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote:
> > On 12/19/19 5:26 AM, Leon Romanovsky wrote:
> > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
> > > > Hi,
> > > >
> > > > This implements an API naming change (put_user_page*() -->
> > > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> > > > extends that tracking to a few select subsystems. More subsystems will
> > > > be added in follow up work.
> > >
> > > Hi John,
> > >
> > > The patchset generates kernel panics in our IB testing. In our tests, we
> > > allocated single memory block and registered multiple MRs using the single
> > > block.
> > >
> > > The possible bad flow is:
> > >   ib_umem_geti() ->
> > >    pin_user_pages_fast(FOLL_WRITE) ->
> > >     internal_get_user_pages_fast(FOLL_WRITE) ->
> > >      gup_pgd_range() ->
> > >       gup_huge_pd() ->
> > >        gup_hugepte() ->
> > >         try_grab_compound_head() ->
> >
> > Hi Leon,
> >
> > Thanks very much for the detailed report! So we're overflowing...
> >
> > At first look, this seems likely to be hitting a weak point in the
> > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
> > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines
> > 99-121). Basically it's pretty easy to overflow the page->_refcount
> > with huge pages if the pages have a *lot* of subpages.
> >
> > We can only do about 7 pins on 1GB huge pages that use 4KB subpages.
>
> Considering that establishing these pins is entirely under user
> control, we can't have a limit here.
>
> If the number of allowed pins are exhausted then the
> pin_user_pages_fast() must fail back to the user.
>
> > 3. It would be nice if I could reproduce this. I have a two-node mlx5 Infiniband
> > test setup, but I have done only the tiniest bit of user space IB coding, so
> > if you have any test programs that aren't too hard to deal with that could
> > possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind
> > that I'm not an advanced IB programmer. At all. :)
>
> Clone this:
>
> https://github.com/linux-rdma/rdma-core.git
>
> Install all the required deps to build it (notably cython), see the README.md
>
> $ ./build.sh
> $ build/bin/run_tests.py
>
> If you get things that far I think Leon can get a reproduction for you

I'm not so optimistic about that.

Thanks

>
> Jason

Leon Romanovsky Dec. 20, 2019, 6:48 p.m. UTC | #10

On Thu, Dec 19, 2019 at 02:58:43PM -0800, John Hubbard wrote:
> On 12/19/19 1:07 PM, Jason Gunthorpe wrote:
> ...
> > > 3. It would be nice if I could reproduce this. I have a two-node mlx5 Infiniband
> > > test setup, but I have done only the tiniest bit of user space IB coding, so
> > > if you have any test programs that aren't too hard to deal with that could
> > > possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind
> > > that I'm not an advanced IB programmer. At all. :)
> >
> > Clone this:
> >
> > https://github.com/linux-rdma/rdma-core.git
> >
> > Install all the required deps to build it (notably cython), see the README.md
> >
> > $ ./build.sh
> > $ build/bin/run_tests.py
> >
> > If you get things that far I think Leon can get a reproduction for you
> >
>
> Cool, it's up and running (1 failure, 3 skipped, out of 67 tests).
>
> This is a great test suite to have running, I'll add it to my scripts. Here's the
> full output in case the failure or skip cases are a problem:
>
> $ sudo ./build/bin/run_tests.py --verbose
>
> test_create_ah (tests.test_addr.AHTest) ... ok
> test_create_ah_roce (tests.test_addr.AHTest) ... skipped "Can't run RoCE tests on IB link layer"
> test_destroy_ah (tests.test_addr.AHTest) ... ok
> test_create_comp_channel (tests.test_cq.CCTest) ... ok
> test_destroy_comp_channel (tests.test_cq.CCTest) ... ok
> test_create_cq_ex (tests.test_cq.CQEXTest) ... ok
> test_create_cq_ex_bad_flow (tests.test_cq.CQEXTest) ... ok
> test_destroy_cq_ex (tests.test_cq.CQEXTest) ... ok
> test_create_cq (tests.test_cq.CQTest) ... ok
> test_create_cq_bad_flow (tests.test_cq.CQTest) ... ok
> test_destroy_cq (tests.test_cq.CQTest) ... ok
> test_rc_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok
> test_ud_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok
> test_xrc_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok
> test_create_dm (tests.test_device.DMTest) ... ok
> test_create_dm_bad_flow (tests.test_device.DMTest) ... ok
> test_destroy_dm (tests.test_device.DMTest) ... ok
> test_destroy_dm_bad_flow (tests.test_device.DMTest) ... ok
> test_dm_read (tests.test_device.DMTest) ... ok
> test_dm_write (tests.test_device.DMTest) ... ok
> test_dm_write_bad_flow (tests.test_device.DMTest) ... ok
> test_dev_list (tests.test_device.DeviceTest) ... ok
> test_open_dev (tests.test_device.DeviceTest) ... ok
> test_query_device (tests.test_device.DeviceTest) ... ok
> test_query_device_ex (tests.test_device.DeviceTest) ... ok
> test_query_gid (tests.test_device.DeviceTest) ... ok
> test_query_port (tests.test_device.DeviceTest) ... FAIL
> test_query_port_bad_flow (tests.test_device.DeviceTest) ... ok
> test_create_dm_mr (tests.test_mr.DMMRTest) ... ok
> test_destroy_dm_mr (tests.test_mr.DMMRTest) ... ok
> test_buffer (tests.test_mr.MRTest) ... ok
> test_dereg_mr (tests.test_mr.MRTest) ... ok
> test_dereg_mr_twice (tests.test_mr.MRTest) ... ok
> test_lkey (tests.test_mr.MRTest) ... ok
> test_read (tests.test_mr.MRTest) ... ok
> test_reg_mr (tests.test_mr.MRTest) ... ok
> test_reg_mr_bad_flags (tests.test_mr.MRTest) ... ok
> test_reg_mr_bad_flow (tests.test_mr.MRTest) ... ok
> test_rkey (tests.test_mr.MRTest) ... ok
> test_write (tests.test_mr.MRTest) ... ok
> test_dereg_mw_type1 (tests.test_mr.MWTest) ... ok
> test_dereg_mw_type2 (tests.test_mr.MWTest) ... ok
> test_reg_mw_type1 (tests.test_mr.MWTest) ... ok
> test_reg_mw_type2 (tests.test_mr.MWTest) ... ok
> test_reg_mw_wrong_type (tests.test_mr.MWTest) ... ok
> test_odp_rc_traffic (tests.test_odp.OdpTestCase) ... ok
> test_odp_ud_traffic (tests.test_odp.OdpTestCase) ... skipped 'ODP is not supported - ODP recv not supported'
> test_odp_xrc_traffic (tests.test_odp.OdpTestCase) ... ok
> test_default_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok
> test_mem_align_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok
> test_without_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok
> test_alloc_pd (tests.test_pd.PDTest) ... ok
> test_create_pd_none_ctx (tests.test_pd.PDTest) ... ok
> test_dealloc_pd (tests.test_pd.PDTest) ... ok
> test_destroy_pd_twice (tests.test_pd.PDTest) ... ok
> test_multiple_pd_creation (tests.test_pd.PDTest) ... ok
> test_create_qp_ex_no_attr (tests.test_qp.QPTest) ... ok
> test_create_qp_ex_no_attr_connected (tests.test_qp.QPTest) ... ok
> test_create_qp_ex_with_attr (tests.test_qp.QPTest) ... ok
> test_create_qp_ex_with_attr_connected (tests.test_qp.QPTest) ... ok
> test_create_qp_no_attr (tests.test_qp.QPTest) ... ok
> test_create_qp_no_attr_connected (tests.test_qp.QPTest) ... ok
> test_create_qp_with_attr (tests.test_qp.QPTest) ... ok
> test_create_qp_with_attr_connected (tests.test_qp.QPTest) ... ok
> test_modify_qp (tests.test_qp.QPTest) ... ok
> test_query_qp (tests.test_qp.QPTest) ... ok
> test_rdmacm_sync_traffic (tests.test_rdmacm.CMTestCase) ... skipped 'No devices with net interface'
>
> ======================================================================
> FAIL: test_query_port (tests.test_device.DeviceTest)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "/kernel_work/rdma-core/tests/test_device.py", line 129, in test_query_port
>     self.verify_port_attr(port_attr)
>   File "/kernel_work/rdma-core/tests/test_device.py", line 113, in verify_port_attr
>     assert 'Invalid' not in d.speed_to_str(attr.active_speed)
> AssertionError

I'm very curious how did you get this assert "d.speed_to_str" covers all
known speeds according to the IBTA.

Thanks

>
> ----------------------------------------------------------------------
> Ran 67 tests in 10.058s
>
> FAILED (failures=1, skipped=3)
>
>
> thanks,
> --
> John Hubbard
> NVIDIA

John Hubbard Dec. 20, 2019, 11:13 p.m. UTC | #11

On 12/20/19 10:48 AM, Leon Romanovsky wrote:
...
>> test_query_qp (tests.test_qp.QPTest) ... ok
>> test_rdmacm_sync_traffic (tests.test_rdmacm.CMTestCase) ... skipped 'No devices with net interface'
>>
>> ======================================================================
>> FAIL: test_query_port (tests.test_device.DeviceTest)
>> ----------------------------------------------------------------------
>> Traceback (most recent call last):
>>   File "/kernel_work/rdma-core/tests/test_device.py", line 129, in test_query_port
>>     self.verify_port_attr(port_attr)
>>   File "/kernel_work/rdma-core/tests/test_device.py", line 113, in verify_port_attr
>>     assert 'Invalid' not in d.speed_to_str(attr.active_speed)
>> AssertionError
> 
> I'm very curious how did you get this assert "d.speed_to_str" covers all
> known speeds according to the IBTA.
> 

Hi Leon,

Short answer: I can make that one pass, with a small fix the the rdma-core test
suite:

commit a1b9fb0846e1b2356d7a16f4fbdd1960cf8dcbe5 (HEAD -> fix_speed_to_str)
Author: John Hubbard <jhubbard@nvidia.com>
Date:   Fri Dec 20 15:07:47 2019 -0800

    device: fix speed_to_str(), to handle disabled links
    
    For disabled links, the raw speed token is 0. However,
    speed_to_str() doesn't have that in the list. This leads
    to an assertion when running tests (test_query_port) when
    one link is down and other link(s) are up.
    
    Fix this by returning '(Disabled/down)' for the zero speed
    case.

diff --git a/pyverbs/device.pyx b/pyverbs/device.pyx
index 33d133fd..f8b7826b 100755
--- a/pyverbs/device.pyx
+++ b/pyverbs/device.pyx
@@ -923,8 +923,8 @@ def width_to_str(width):
 
 
 def speed_to_str(speed):
-    l = {1: '2.5 Gbps', 2: '5.0 Gbps', 4: '5.0 Gbps', 8: '10.0 Gbps',
-         16: '14.0 Gbps', 32: '25.0 Gbps', 64: '50.0 Gbps'}
+    l = {0: '(Disabled/down)', 1: '2.5 Gbps', 2: '5.0 Gbps', 4: '5.0 Gbps',
+         8: '10.0 Gbps', 16: '14.0 Gbps', 32: '25.0 Gbps', 64: '50.0 Gbps'}
     try:
         return '{s} ({n})'.format(s=l[speed], n=speed)
     except KeyError:


Longer answer:
==============

It looks like this test suite assumes that every link is connected! (Probably
in most test systems, they are.) But in my setup, the ConnectX cards each have
two slots, and I only have (and only need) one cable. So one link is up, and
the other is disabled. 

This leads to the other problem, which is that if a link is disabled, the
test suite finds a "0" token for attr.active_speed. That token is not in the
approved list, and so d.speed_to_str() asserts.

With some diagnostics added, I can see it checking each link: one passes, and
the other asserts:

diff --git a/tests/test_device.py b/tests/test_device.py
index 524e0e89..7b33d7db 100644
--- a/tests/test_device.py
+++ b/tests/test_device.py
@@ -110,6 +110,12 @@ class DeviceTest(unittest.TestCase):
         assert 'Invalid' not in d.translate_mtu(attr.max_mtu)
         assert 'Invalid' not in d.translate_mtu(attr.active_mtu)
         assert 'Invalid' not in d.width_to_str(attr.active_width)
+        print("")
+        print('Diagnostics ===========================================')
+        print('phys_state:    ', d.phys_state_to_str(attr.phys_state))
+        print('active_width): ', d.width_to_str(attr.active_width))
+        print('active_speed:  ',   d.speed_to_str(attr.active_speed))
+        print('END of Diagnostics ====================================')
         assert 'Invalid' not in d.speed_to_str(attr.active_speed)
         assert 'Invalid' not in d.translate_link_layer(attr.link_layer)
         assert attr.max_msg_sz > 0x1000

         assert attr.max_msg_sz > 0x1000

...and the test run from that is:

# ./build/bin/run_tests.py --verbose tests.test_device.DeviceTest
test_dev_list (tests.test_device.DeviceTest) ... ok
test_open_dev (tests.test_device.DeviceTest) ... ok
test_query_device (tests.test_device.DeviceTest) ... ok
test_query_device_ex (tests.test_device.DeviceTest) ... ok
test_query_gid (tests.test_device.DeviceTest) ... ok
test_query_port (tests.test_device.DeviceTest) ... 
Diagnostics ===========================================
phys_state:     Link up (5)
active_width):  4X (2)
active_speed:   25.0 Gbps (32)
END of Diagnostics ====================================

Diagnostics ===========================================
phys_state:     Disabled (3)
active_width):  4X (2)
active_speed:   Invalid speed
END of Diagnostics ====================================
FAIL
test_query_port_bad_flow (tests.test_device.DeviceTest) ... ok

======================================================================
FAIL: test_query_port (tests.test_device.DeviceTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/kernel_work/rdma-core/tests/test_device.py", line 135, in test_query_port
    self.verify_port_attr(port_attr)
  File "/kernel_work/rdma-core/tests/test_device.py", line 119, in verify_port_attr
    assert 'Invalid' not in d.speed_to_str(attr.active_speed)
AssertionError

----------------------------------------------------------------------
Ran 7 tests in 0.055s

FAILED (failures=1)



thanks,

John Hubbard Dec. 20, 2019, 11:54 p.m. UTC | #12

On 12/20/19 10:29 AM, Leon Romanovsky wrote:
...
>> $ ./build.sh
>> $ build/bin/run_tests.py
>>
>> If you get things that far I think Leon can get a reproduction for you
> 
> I'm not so optimistic about that.
> 

OK, I'm going to proceed for now on the assumption that I've got an overflow
problem that happens when huge pages are pinned. If I can get more information,
great, otherwise it's probably enough.

One thing: for your repro, if you know the huge page size, and the system
page size for that case, that would really help. Also the number of pins per
page, more or less, that you'd expect. Because Jason says that only 2M huge 
pages are used...

Because the other possibility is that the refcount really is going negative, 
likely due to a mismatched pin/unpin somehow.

If there's not an obvious repro case available, but you do have one (is it easy
to repro, though?), then *if* you have the time, I could point you to a github
branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this:

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bb44c4d2ada7..8526fd03b978 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1077,7 +1077,7 @@ static inline void put_page(struct page *page)
  * get_user_pages and page_mkclean and other calls that race to set up page
  * table entries.
  */
-#define GUP_PIN_COUNTING_BIAS (1U << 10)
+#define GUP_PIN_COUNTING_BIAS (1U << 8)

 void unpin_user_page(struct page *page);
 void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,

If that fails to repro, then we would be zeroing in on the root cause. 

The branch is here (I just tested it and it seems healthy):

git@github.com:johnhubbard/linux.git  pin_user_pages_tracking_v11_with_diags

thanks,

John Hubbard Dec. 21, 2019, 12:02 a.m. UTC | #13

On 12/20/19 1:21 AM, Jan Kara wrote:
...
>> So, ideas and next steps:
>>
>> 1. Assuming that you *are* hitting this, I think I may have to fall back to
>> implementing the "deferred" part of this design, as part of this series, after
>> all. That means:
>>
>>   For the pin/unpin calls at least, stop treating all pages as if they are
>>   a cluster of PAGE_SIZE pages; instead, retrieve a huge page as one page.
>>   That's not how it works now, and the need to hand back a huge array of
>>   subpages is part of the problem. This affects the callers too, so it's not
>>   a super quick change to make. (I was really hoping not to have to do this
>>   yet.)
> 
> Does that mean that you would need to make all GUP users huge page aware?
> Otherwise I don't see how what you suggest would work... And I don't think
> making all GUP users huge page aware is realistic (effort-wise) or even
> wanted (maintenance overhead in all those places).
> 

Well, pretty much yes. It's really just the pin_user_pages*() callers, but
the internals, follow_page() and such, are so interconnected right now that
it would probably blow up into a huge effort, as you point out.

> I believe there might be also a different solution for this: For
> transparent huge pages, we could find a space in 'struct page' of the
> second page in the huge page for proper pin counter and just account pins
> there so we'd have full width of 32-bits for it.
> 
> 								Honza
> 

OK, let me pursue that. Given that I shouldn't need to handle pages
splitting, it should be not *too* bad.

I am starting to think that I should just post the first 9 or so 
prerequisite patches (first 9 patches, plus the v4l2 fix that arguably should 
have been earlier in the sequence I guess), as 5.6 candidates, while I go
back to the drawing board here. 

thanks,

Dan Williams Dec. 21, 2019, 12:32 a.m. UTC | #14

On Fri, Dec 20, 2019 at 5:34 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Dec 19, 2019 at 01:13:54PM -0800, John Hubbard wrote:
> > On 12/19/19 1:07 PM, Jason Gunthorpe wrote:
> > > On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote:
> > > > On 12/19/19 5:26 AM, Leon Romanovsky wrote:
> > > > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
> > > > > > Hi,
> > > > > >
> > > > > > This implements an API naming change (put_user_page*() -->
> > > > > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> > > > > > extends that tracking to a few select subsystems. More subsystems will
> > > > > > be added in follow up work.
> > > > >
> > > > > Hi John,
> > > > >
> > > > > The patchset generates kernel panics in our IB testing. In our tests, we
> > > > > allocated single memory block and registered multiple MRs using the single
> > > > > block.
> > > > >
> > > > > The possible bad flow is:
> > > > >    ib_umem_geti() ->
> > > > >     pin_user_pages_fast(FOLL_WRITE) ->
> > > > >      internal_get_user_pages_fast(FOLL_WRITE) ->
> > > > >       gup_pgd_range() ->
> > > > >        gup_huge_pd() ->
> > > > >         gup_hugepte() ->
> > > > >          try_grab_compound_head() ->
> > > >
> > > > Hi Leon,
> > > >
> > > > Thanks very much for the detailed report! So we're overflowing...
> > > >
> > > > At first look, this seems likely to be hitting a weak point in the
> > > > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
> > > > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines
> > > > 99-121). Basically it's pretty easy to overflow the page->_refcount
> > > > with huge pages if the pages have a *lot* of subpages.
> > > >
> > > > We can only do about 7 pins on 1GB huge pages that use 4KB subpages.
> > >
> > > Considering that establishing these pins is entirely under user
> > > control, we can't have a limit here.
> >
> > There's already a limit, it's just a much larger one. :) What does "no limit"
> > really mean, numerically, to you in this case?
>
> I guess I mean 'hidden limit' - hitting the limit and failing would
> be managable.
>
> I think 7 is probably too low though, but we are not using 1GB huge
> pages, only 2M..

What about RDMA to 1GB-hugetlbfs and 1GB-device-dax mappings?

Dan Williams Dec. 21, 2019, 12:33 a.m. UTC | #15

On Fri, Dec 20, 2019 at 1:22 AM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 19-12-19 12:30:31, John Hubbard wrote:
> > On 12/19/19 5:26 AM, Leon Romanovsky wrote:
> > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
> > > > Hi,
> > > >
> > > > This implements an API naming change (put_user_page*() -->
> > > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> > > > extends that tracking to a few select subsystems. More subsystems will
> > > > be added in follow up work.
> > >
> > > Hi John,
> > >
> > > The patchset generates kernel panics in our IB testing. In our tests, we
> > > allocated single memory block and registered multiple MRs using the single
> > > block.
> > >
> > > The possible bad flow is:
> > >   ib_umem_geti() ->
> > >    pin_user_pages_fast(FOLL_WRITE) ->
> > >     internal_get_user_pages_fast(FOLL_WRITE) ->
> > >      gup_pgd_range() ->
> > >       gup_huge_pd() ->
> > >        gup_hugepte() ->
> > >         try_grab_compound_head() ->
> >
> > Hi Leon,
> >
> > Thanks very much for the detailed report! So we're overflowing...
> >
> > At first look, this seems likely to be hitting a weak point in the
> > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
> > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines
> > 99-121). Basically it's pretty easy to overflow the page->_refcount
> > with huge pages if the pages have a *lot* of subpages.
> >
> > We can only do about 7 pins on 1GB huge pages that use 4KB subpages.
> > Do you have any idea how many pins (repeated pins on the same page, which
> > it sounds like you have) might be involved in your test case,
> > and the huge page and system page sizes? That would allow calculating
> > if we're likely overflowing for that reason.
> >
> > So, ideas and next steps:
> >
> > 1. Assuming that you *are* hitting this, I think I may have to fall back to
> > implementing the "deferred" part of this design, as part of this series, after
> > all. That means:
> >
> >   For the pin/unpin calls at least, stop treating all pages as if they are
> >   a cluster of PAGE_SIZE pages; instead, retrieve a huge page as one page.
> >   That's not how it works now, and the need to hand back a huge array of
> >   subpages is part of the problem. This affects the callers too, so it's not
> >   a super quick change to make. (I was really hoping not to have to do this
> >   yet.)
>
> Does that mean that you would need to make all GUP users huge page aware?
> Otherwise I don't see how what you suggest would work... And I don't think
> making all GUP users huge page aware is realistic (effort-wise) or even
> wanted (maintenance overhead in all those places).
>
> I believe there might be also a different solution for this: For
> transparent huge pages, we could find a space in 'struct page' of the
> second page in the huge page for proper pin counter and just account pins
> there so we'd have full width of 32-bits for it.

That would require THP accounting for dax pages. It is something that
was probably going to be needed, but this would seem to force the
issue.

John Hubbard Dec. 21, 2019, 12:41 a.m. UTC | #16

On 12/20/19 4:33 PM, Dan Williams wrote:
...
>> I believe there might be also a different solution for this: For
>> transparent huge pages, we could find a space in 'struct page' of the
>> second page in the huge page for proper pin counter and just account pins
>> there so we'd have full width of 32-bits for it.
> 
> That would require THP accounting for dax pages. It is something that
> was probably going to be needed, but this would seem to force the
> issue.
> 

Thanks for mentioning that, it wasn't obvious to me yet. 

How easy is it for mere mortals outside of Intel, to set up a DAX (nvdimm?)
test setup? I'd hate to go into this without having that coverage up
and running. It's been sketchy enough as it is. :)

thanks,

Dan Williams Dec. 21, 2019, 12:51 a.m. UTC | #17

On Fri, Dec 20, 2019 at 4:41 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 12/20/19 4:33 PM, Dan Williams wrote:
> ...
> >> I believe there might be also a different solution for this: For
> >> transparent huge pages, we could find a space in 'struct page' of the
> >> second page in the huge page for proper pin counter and just account pins
> >> there so we'd have full width of 32-bits for it.
> >
> > That would require THP accounting for dax pages. It is something that
> > was probably going to be needed, but this would seem to force the
> > issue.
> >
>
> Thanks for mentioning that, it wasn't obvious to me yet.
>
> How easy is it for mere mortals outside of Intel, to set up a DAX (nvdimm?)
> test setup? I'd hate to go into this without having that coverage up
> and running. It's been sketchy enough as it is. :)

You too can have the power of the gods for the low low price of a
kernel command line parameter, or a qemu setup.

Details here:

https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system
https://nvdimm.wiki.kernel.org/pmem_in_qemu

John Hubbard Dec. 21, 2019, 12:53 a.m. UTC | #18

On 12/20/19 4:51 PM, Dan Williams wrote:
> On Fri, Dec 20, 2019 at 4:41 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>
>> On 12/20/19 4:33 PM, Dan Williams wrote:
>> ...
>>>> I believe there might be also a different solution for this: For
>>>> transparent huge pages, we could find a space in 'struct page' of the
>>>> second page in the huge page for proper pin counter and just account pins
>>>> there so we'd have full width of 32-bits for it.
>>>
>>> That would require THP accounting for dax pages. It is something that
>>> was probably going to be needed, but this would seem to force the
>>> issue.
>>>
>>
>> Thanks for mentioning that, it wasn't obvious to me yet.
>>
>> How easy is it for mere mortals outside of Intel, to set up a DAX (nvdimm?)
>> test setup? I'd hate to go into this without having that coverage up
>> and running. It's been sketchy enough as it is. :)
> 
> You too can have the power of the gods for the low low price of a
> kernel command line parameter, or a qemu setup.
> 
> Details here:
> 
> https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system
> https://nvdimm.wiki.kernel.org/pmem_in_qemu
> 

Sweeeet! Now I can really cause some damage. :)

thanks,

Leon Romanovsky Dec. 21, 2019, 10:08 a.m. UTC | #19

On Fri, Dec 20, 2019 at 03:54:55PM -0800, John Hubbard wrote:
> On 12/20/19 10:29 AM, Leon Romanovsky wrote:
> ...
> >> $ ./build.sh
> >> $ build/bin/run_tests.py
> >>
> >> If you get things that far I think Leon can get a reproduction for you
> >
> > I'm not so optimistic about that.
> >
>
> OK, I'm going to proceed for now on the assumption that I've got an overflow
> problem that happens when huge pages are pinned. If I can get more information,
> great, otherwise it's probably enough.
>
> One thing: for your repro, if you know the huge page size, and the system
> page size for that case, that would really help. Also the number of pins per
> page, more or less, that you'd expect. Because Jason says that only 2M huge
> pages are used...
>
> Because the other possibility is that the refcount really is going negative,
> likely due to a mismatched pin/unpin somehow.
>
> If there's not an obvious repro case available, but you do have one (is it easy
> to repro, though?), then *if* you have the time, I could point you to a github
> branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this:

I'll see what I can do this Sunday.

>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index bb44c4d2ada7..8526fd03b978 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1077,7 +1077,7 @@ static inline void put_page(struct page *page)
>   * get_user_pages and page_mkclean and other calls that race to set up page
>   * table entries.
>   */
> -#define GUP_PIN_COUNTING_BIAS (1U << 10)
> +#define GUP_PIN_COUNTING_BIAS (1U << 8)
>
>  void unpin_user_page(struct page *page);
>  void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>
> If that fails to repro, then we would be zeroing in on the root cause.
>
> The branch is here (I just tested it and it seems healthy):
>
> git@github.com:johnhubbard/linux.git  pin_user_pages_tracking_v11_with_diags
>
>
>
> thanks,
> --
> John Hubbard
> NVIDIA

John Hubbard Dec. 21, 2019, 11:59 p.m. UTC | #20

On 12/21/19 2:08 AM, Leon Romanovsky wrote:
> On Fri, Dec 20, 2019 at 03:54:55PM -0800, John Hubbard wrote:
>> On 12/20/19 10:29 AM, Leon Romanovsky wrote:
>> ...
>>>> $ ./build.sh
>>>> $ build/bin/run_tests.py
>>>>
>>>> If you get things that far I think Leon can get a reproduction for you
>>>
>>> I'm not so optimistic about that.
>>>
>>
>> OK, I'm going to proceed for now on the assumption that I've got an overflow
>> problem that happens when huge pages are pinned. If I can get more information,
>> great, otherwise it's probably enough.
>>
>> One thing: for your repro, if you know the huge page size, and the system
>> page size for that case, that would really help. Also the number of pins per
>> page, more or less, that you'd expect. Because Jason says that only 2M huge
>> pages are used...
>>
>> Because the other possibility is that the refcount really is going negative,
>> likely due to a mismatched pin/unpin somehow.
>>
>> If there's not an obvious repro case available, but you do have one (is it easy
>> to repro, though?), then *if* you have the time, I could point you to a github
>> branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this:
> 
> I'll see what I can do this Sunday.
> 

The other data point that might shed light on whether it's a mismatch (this only
works if the system is not actually crashing, though), is checking the new
vmstat items, like this:

$ grep foll_pin /proc/vmstat
nr_foll_pin_requested 16288188
nr_foll_pin_returned 16288188

...but OTOH, if you've got long-term pins, then those are *supposed* to be
mismatched, so it only really helps in between tests.

thanks,

Leon Romanovsky Dec. 22, 2019, 1:23 p.m. UTC | #21

On Fri, Dec 20, 2019 at 03:54:55PM -0800, John Hubbard wrote:
> On 12/20/19 10:29 AM, Leon Romanovsky wrote:
> ...
> >> $ ./build.sh
> >> $ build/bin/run_tests.py
> >>
> >> If you get things that far I think Leon can get a reproduction for you
> >
> > I'm not so optimistic about that.
> >
>
> OK, I'm going to proceed for now on the assumption that I've got an overflow
> problem that happens when huge pages are pinned. If I can get more information,
> great, otherwise it's probably enough.
>
> One thing: for your repro, if you know the huge page size, and the system
> page size for that case, that would really help. Also the number of pins per
> page, more or less, that you'd expect. Because Jason says that only 2M huge
> pages are used...
>
> Because the other possibility is that the refcount really is going negative,
> likely due to a mismatched pin/unpin somehow.
>
> If there's not an obvious repro case available, but you do have one (is it easy
> to repro, though?), then *if* you have the time, I could point you to a github
> branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this:
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index bb44c4d2ada7..8526fd03b978 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1077,7 +1077,7 @@ static inline void put_page(struct page *page)
>   * get_user_pages and page_mkclean and other calls that race to set up page
>   * table entries.
>   */
> -#define GUP_PIN_COUNTING_BIAS (1U << 10)
> +#define GUP_PIN_COUNTING_BIAS (1U << 8)
>
>  void unpin_user_page(struct page *page);
>  void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>
> If that fails to repro, then we would be zeroing in on the root cause.
>
> The branch is here (I just tested it and it seems healthy):
>
> git@github.com:johnhubbard/linux.git  pin_user_pages_tracking_v11_with_diags

Hi,

We tested the following branch and here comes results:
[root@server consume_mtts]# (master) $ grep foll_pin /proc/vmstat
nr_foll_pin_requested 0
nr_foll_pin_returned 0

[root@serer consume_mtts]# (master) $ dmesg
[  425.221459] ------------[ cut here ]------------
[  425.225894] WARNING: CPU: 1 PID: 6738 at mm/gup.c:61 try_grab_compound_head+0x90/0xa0
[  425.228021] Modules linked in: mlx5_ib mlx5_core mlxfw mlx4_ib mlx4_en ptp pps_core mlx4_core bonding ip6_gre ip6_tunnel tunnel6 ip_gre gre ip_tunnel rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm ib_uverbs ib_ipoib ib_umad ib_srp scsi_transport_srp rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core [last unloaded: mlxfw]
[  425.235266] CPU: 1 PID: 6738 Comm: consume_mtts Tainted: G           O      5.5.0-rc2+ #1
[  425.237480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[  425.239738] RIP: 0010:try_grab_compound_head+0x90/0xa0
[  425.241170] Code: 06 48 8d 4f 34 f0 0f b1 57 34 74 cd 85 c0 74 cf 8d 14 06 f0 0f b1 11 74 c0 eb f1 8d 14 06 f0 0f b1 11 74 b5 85 c0 75 f3 eb b5 <0f> 0b 31 c0 c3 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41
[  425.245739] RSP: 0018:ffffc900006878a8 EFLAGS: 00010082
[  425.247124] RAX: 0000000080000001 RBX: 00007f780488a000 RCX: 0000000000000bb0
[  425.248956] RDX: ffffea000e031087 RSI: 0000000000008a00 RDI: ffffea000dc58000
[  425.250761] RBP: ffffea000e031080 R08: ffffc90000687974 R09: 000fffffffe00000
[  425.252661] R10: 0000000000000000 R11: ffff888362560000 R12: 000000000000008a
[  425.254487] R13: 80000003716000e7 R14: 00007f780488a000 R15: ffffc90000687974
[  425.256309] FS:  00007f780d9d3740(0000) GS:ffff8883b1c80000(0000) knlGS:0000000000000000
[  425.258401] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  425.259949] CR2: 0000000002334048 CR3: 000000039c68c001 CR4: 00000000001606a0
[  425.261884] Call Trace:
[  425.262735]  gup_pgd_range+0x517/0x5a0
[  425.263819]  internal_get_user_pages_fast+0x210/0x250
[  425.265193]  ib_umem_get+0x298/0x550 [ib_uverbs]
[  425.266476]  mr_umem_get+0xc9/0x260 [mlx5_ib]
[  425.267699]  mlx5_ib_reg_user_mr+0xcc/0x7e0 [mlx5_ib]
[  425.269134]  ? xas_load+0x8/0x80
[  425.270074]  ? xa_load+0x48/0x90
[  425.271038]  ? lookup_get_idr_uobject.part.10+0x12/0x70 [ib_uverbs]
[  425.272757]  ib_uverbs_reg_mr+0x127/0x280 [ib_uverbs]
[  425.274120]  ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xc2/0xf0 [ib_uverbs]
[  425.276058]  ib_uverbs_cmd_verbs.isra.6+0x5be/0xbe0 [ib_uverbs]
[  425.277657]  ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs]
[  425.279155]  ? __alloc_pages_nodemask+0x148/0x2b0
[  425.280445]  ib_uverbs_ioctl+0xc0/0x120 [ib_uverbs]
[  425.281755]  do_vfs_ioctl+0x9d/0x650
[  425.282766]  ksys_ioctl+0x70/0x80
[  425.283745]  __x64_sys_ioctl+0x16/0x20
[  425.284912]  do_syscall_64+0x42/0x130
[  425.285973]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  425.287377] RIP: 0033:0x7f780d2df267
[  425.288449] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48
[  425.293073] RSP: 002b:00007ffce49a88a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  425.295034] RAX: ffffffffffffffda RBX: 00007ffce49a8938 RCX: 00007f780d2df267
[  425.296895] RDX: 00007ffce49a8920 RSI: 00000000c0181b01 RDI: 0000000000000003
[  425.298689] RBP: 00007ffce49a8900 R08: 0000000000000003 R09: 00007f780d9a1010
[  425.300480] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f780d9a1150
[  425.302290] R13: 00007ffce49a8900 R14: 00007ffce49a8ad8 R15: 00007f780468a000
[  425.304113] ---[ end trace 1ecbefdb403190dd ]---
[  425.305434] ------------[ cut here ]------------
[  425.307147] WARNING: CPU: 1 PID: 6738 at mm/gup.c:150 try_grab_page+0x56/0x60
[  425.309111] Modules linked in: mlx5_ib mlx5_core mlxfw mlx4_ib mlx4_en ptp pps_core mlx4_core bonding ip6_gre ip6_tunnel tunnel6 ip_gre gre ip_tunnel rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm ib_uverbs ib_ipoib ib_umad ib_srp scsi_transport_srp rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core [last unloaded: mlxfw]
[  425.316461] CPU: 1 PID: 6738 Comm: consume_mtts Tainted: G        W  O      5.5.0-rc2+ #1
[  425.318582] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[  425.320958] RIP: 0010:try_grab_page+0x56/0x60
[  425.322167] Code: 7e 28 f0 81 47 34 00 01 00 00 c3 48 8b 47 08 48 8d 50 ff a8 01 48 0f 45 fa 8b 47 34 85 c0 7e 0f f0 ff 47 34 b8 01 00 00 00 c3 <0f> 0b 31 c0 c3 0f 0b 31 c0 c3 0f 1f 44 00 00 41 57 41 56 41 55 41
[  425.326814] RSP: 0018:ffffc90000687830 EFLAGS: 00010282
[  425.328226] RAX: 0000000000000001 RBX: ffffea000dc58000 RCX: ffffea000e031087
[  425.330104] RDX: 0000000080000001 RSI: 0000000000040000 RDI: ffffea000dc58000
[  425.331980] RBP: 00007f7804800000 R08: 000ffffffffff000 R09: 80000003716000e7
[  425.333898] R10: ffff88834af80120 R11: ffff8883ac16f000 R12: ffff88834af80120
[  425.335704] R13: ffff88837c0915c0 R14: 0000000000050201 R15: 00007f7804800000
[  425.337638] FS:  00007f780d9d3740(0000) GS:ffff8883b1c80000(0000) knlGS:0000000000000000
[  425.339734] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  425.341369] CR2: 0000000002334048 CR3: 000000039c68c001 CR4: 00000000001606a0
[  425.343160] Call Trace:
[  425.343967]  follow_trans_huge_pmd+0x16f/0x2e0
[  425.345263]  follow_p4d_mask+0x51c/0x630
[  425.346344]  __get_user_pages+0x1a1/0x6c0
[  425.347463]  internal_get_user_pages_fast+0x17b/0x250
[  425.348918]  ib_umem_get+0x298/0x550 [ib_uverbs]
[  425.350174]  mr_umem_get+0xc9/0x260 [mlx5_ib]
[  425.351383]  mlx5_ib_reg_user_mr+0xcc/0x7e0 [mlx5_ib]
[  425.352849]  ? xas_load+0x8/0x80
[  425.353776]  ? xa_load+0x48/0x90
[  425.354730]  ? lookup_get_idr_uobject.part.10+0x12/0x70 [ib_uverbs]
[  425.356410]  ib_uverbs_reg_mr+0x127/0x280 [ib_uverbs]
[  425.357843]  ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xc2/0xf0 [ib_uverbs]
[  425.359749]  ib_uverbs_cmd_verbs.isra.6+0x5be/0xbe0 [ib_uverbs]
[  425.361405]  ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs]
[  425.362898]  ? __alloc_pages_nodemask+0x148/0x2b0
[  425.364206]  ib_uverbs_ioctl+0xc0/0x120 [ib_uverbs]
[  425.365564]  do_vfs_ioctl+0x9d/0x650
[  425.366567]  ksys_ioctl+0x70/0x80
[  425.367537]  __x64_sys_ioctl+0x16/0x20
[  425.368698]  do_syscall_64+0x42/0x130
[  425.369782]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  425.371117] RIP: 0033:0x7f780d2df267
[  425.372159] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48
[  425.376774] RSP: 002b:00007ffce49a88a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  425.378740] RAX: ffffffffffffffda RBX: 00007ffce49a8938 RCX: 00007f780d2df267
[  425.380598] RDX: 00007ffce49a8920 RSI: 00000000c0181b01 RDI: 0000000000000003
[  425.382411] RBP: 00007ffce49a8900 R08: 0000000000000003 R09: 00007f780d9a1010
[  425.384312] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f780d9a1150
[  425.386132] R13: 00007ffce49a8900 R14: 00007ffce49a8ad8 R15: 00007f780468a000
[  425.387964] ---[ end trace 1ecbefdb403190de ]---

Thanks

>
>
>
> thanks,
> --
> John Hubbard
> NVIDIA

Jason Gunthorpe Dec. 23, 2019, 6:24 p.m. UTC | #22

On Fri, Dec 20, 2019 at 04:32:13PM -0800, Dan Williams wrote:

> > > There's already a limit, it's just a much larger one. :) What does "no limit"
> > > really mean, numerically, to you in this case?
> >
> > I guess I mean 'hidden limit' - hitting the limit and failing would
> > be managable.
> >
> > I think 7 is probably too low though, but we are not using 1GB huge
> > pages, only 2M..
> 
> What about RDMA to 1GB-hugetlbfs and 1GB-device-dax mappings?

I don't think the failing testing is doing that.

It is also less likely that 1GB regions will need multi-mapping, IMHO.

Jason

John Hubbard Dec. 25, 2019, 2:03 a.m. UTC | #23

On 12/22/19 5:23 AM, Leon Romanovsky wrote:
> On Fri, Dec 20, 2019 at 03:54:55PM -0800, John Hubbard wrote:
>> On 12/20/19 10:29 AM, Leon Romanovsky wrote:
>> ...
>>>> $ ./build.sh
>>>> $ build/bin/run_tests.py
>>>>
>>>> If you get things that far I think Leon can get a reproduction for you
>>>
>>> I'm not so optimistic about that.
>>>
>>
>> OK, I'm going to proceed for now on the assumption that I've got an overflow
>> problem that happens when huge pages are pinned. If I can get more information,
>> great, otherwise it's probably enough.
>>
>> One thing: for your repro, if you know the huge page size, and the system
>> page size for that case, that would really help. Also the number of pins per
>> page, more or less, that you'd expect. Because Jason says that only 2M huge
>> pages are used...
>>
>> Because the other possibility is that the refcount really is going negative,
>> likely due to a mismatched pin/unpin somehow.
>>
>> If there's not an obvious repro case available, but you do have one (is it easy
>> to repro, though?), then *if* you have the time, I could point you to a github
>> branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this:
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index bb44c4d2ada7..8526fd03b978 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1077,7 +1077,7 @@ static inline void put_page(struct page *page)
>>    * get_user_pages and page_mkclean and other calls that race to set up page
>>    * table entries.
>>    */
>> -#define GUP_PIN_COUNTING_BIAS (1U << 10)
>> +#define GUP_PIN_COUNTING_BIAS (1U << 8)
>>
>>   void unpin_user_page(struct page *page);
>>   void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>>
>> If that fails to repro, then we would be zeroing in on the root cause.
>>
>> The branch is here (I just tested it and it seems healthy):
>>
>> git@github.com:johnhubbard/linux.git  pin_user_pages_tracking_v11_with_diags
> 
> Hi,
> 
> We tested the following branch and here comes results:

Thanks for this testing run!

> [root@server consume_mtts]# (master) $ grep foll_pin /proc/vmstat
> nr_foll_pin_requested 0
> nr_foll_pin_returned 0
> 

Zero pinned pages!

...now I'm confused. Somehow FOLL_PIN and pin_user_pages*() calls are
not happening. And although the backtraces below show some of my new
routines (like try_grab_page), they also confirm the above: there is no
pin_user_page*() call in the stack.

In particular, it looks like ib_umem_get() is calling through to
get_user_pages*(), rather than pin_user_pages*(). I don't see how this
is possible, because the code on my screen shows ib_umem_get() calling
pin_user_pages_fast().

Any thoughts or ideas are welcome here.

However, glossing over all of that and assuming that the new
GUP_PIN_COUNTING_BIAS of 256 is applied, it's interesting that we still
see any overflow. I'm less confident now that this is a true refcount
overflow.

Also, any information that would get me closer to being able to attempt
my own reproduction of the problem are *very* welcome. :)

thanks,

Leon Romanovsky Dec. 25, 2019, 5:26 a.m. UTC | #24

On Tue, Dec 24, 2019 at 06:03:50PM -0800, John Hubbard wrote:
> On 12/22/19 5:23 AM, Leon Romanovsky wrote:
> > On Fri, Dec 20, 2019 at 03:54:55PM -0800, John Hubbard wrote:
> > > On 12/20/19 10:29 AM, Leon Romanovsky wrote:
> > > ...
> > > > > $ ./build.sh
> > > > > $ build/bin/run_tests.py
> > > > >
> > > > > If you get things that far I think Leon can get a reproduction for you
> > > >
> > > > I'm not so optimistic about that.
> > > >
> > >
> > > OK, I'm going to proceed for now on the assumption that I've got an overflow
> > > problem that happens when huge pages are pinned. If I can get more information,
> > > great, otherwise it's probably enough.
> > >
> > > One thing: for your repro, if you know the huge page size, and the system
> > > page size for that case, that would really help. Also the number of pins per
> > > page, more or less, that you'd expect. Because Jason says that only 2M huge
> > > pages are used...
> > >
> > > Because the other possibility is that the refcount really is going negative,
> > > likely due to a mismatched pin/unpin somehow.
> > >
> > > If there's not an obvious repro case available, but you do have one (is it easy
> > > to repro, though?), then *if* you have the time, I could point you to a github
> > > branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this:
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index bb44c4d2ada7..8526fd03b978 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -1077,7 +1077,7 @@ static inline void put_page(struct page *page)
> > >    * get_user_pages and page_mkclean and other calls that race to set up page
> > >    * table entries.
> > >    */
> > > -#define GUP_PIN_COUNTING_BIAS (1U << 10)
> > > +#define GUP_PIN_COUNTING_BIAS (1U << 8)
> > >
> > >   void unpin_user_page(struct page *page);
> > >   void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
> > >
> > > If that fails to repro, then we would be zeroing in on the root cause.
> > >
> > > The branch is here (I just tested it and it seems healthy):
> > >
> > > git@github.com:johnhubbard/linux.git  pin_user_pages_tracking_v11_with_diags
> >
> > Hi,
> >
> > We tested the following branch and here comes results:
>
> Thanks for this testing run!
>
> > [root@server consume_mtts]# (master) $ grep foll_pin /proc/vmstat
> > nr_foll_pin_requested 0
> > nr_foll_pin_returned 0
> >
>
> Zero pinned pages!

Maybe we are missing some CONFIG_* option?
https://lore.kernel.org/linux-rdma/12a28917-f8c9-5092-2f01-92bb74714cae@nvidia.com/T/#mf900896f5dfc86cdee9246219990c632ed77115f

>
> ...now I'm confused. Somehow FOLL_PIN and pin_user_pages*() calls are
> not happening. And although the backtraces below show some of my new
> routines (like try_grab_page), they also confirm the above: there is no
> pin_user_page*() call in the stack.
>
> In particular, it looks like ib_umem_get() is calling through to
> get_user_pages*(), rather than pin_user_pages*(). I don't see how this
> is possible, because the code on my screen shows ib_umem_get() calling
> pin_user_pages_fast().
>
> Any thoughts or ideas are welcome here.
>
> However, glossing over all of that and assuming that the new
> GUP_PIN_COUNTING_BIAS of 256 is applied, it's interesting that we still
> see any overflow. I'm less confident now that this is a true refcount
> overflow.

Earlier in this email thread, I posted possible function call chain which
doesn't involve refcount overflow, but for some reason the refcount
overflow was chosen as a way to explore.

>
> Also, any information that would get me closer to being able to attempt
> my own reproduction of the problem are *very* welcome. :)

It is ancient verification test (~10y) which is not an easy task to
make it understandable and standalone :).

>
> thanks,
> --
> John Hubbard
> NVIDIA
>
> > [root@serer consume_mtts]# (master) $ dmesg
> > [  425.221459] ------------[ cut here ]------------
> > [  425.225894] WARNING: CPU: 1 PID: 6738 at mm/gup.c:61 try_grab_compound_head+0x90/0xa0
> > [  425.228021] Modules linked in: mlx5_ib mlx5_core mlxfw mlx4_ib mlx4_en ptp pps_core mlx4_core bonding ip6_gre ip6_tunnel tunnel6 ip_gre gre ip_tunnel rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm ib_uverbs ib_ipoib ib_umad ib_srp scsi_transport_srp rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core [last unloaded: mlxfw]
> > [  425.235266] CPU: 1 PID: 6738 Comm: consume_mtts Tainted: G           O      5.5.0-rc2+ #1
> > [  425.237480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
> > [  425.239738] RIP: 0010:try_grab_compound_head+0x90/0xa0
> > [  425.241170] Code: 06 48 8d 4f 34 f0 0f b1 57 34 74 cd 85 c0 74 cf 8d 14 06 f0 0f b1 11 74 c0 eb f1 8d 14 06 f0 0f b1 11 74 b5 85 c0 75 f3 eb b5 <0f> 0b 31 c0 c3 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41
> > [  425.245739] RSP: 0018:ffffc900006878a8 EFLAGS: 00010082
> > [  425.247124] RAX: 0000000080000001 RBX: 00007f780488a000 RCX: 0000000000000bb0
> > [  425.248956] RDX: ffffea000e031087 RSI: 0000000000008a00 RDI: ffffea000dc58000
> > [  425.250761] RBP: ffffea000e031080 R08: ffffc90000687974 R09: 000fffffffe00000
> > [  425.252661] R10: 0000000000000000 R11: ffff888362560000 R12: 000000000000008a
> > [  425.254487] R13: 80000003716000e7 R14: 00007f780488a000 R15: ffffc90000687974
> > [  425.256309] FS:  00007f780d9d3740(0000) GS:ffff8883b1c80000(0000) knlGS:0000000000000000
> > [  425.258401] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  425.259949] CR2: 0000000002334048 CR3: 000000039c68c001 CR4: 00000000001606a0
> > [  425.261884] Call Trace:
> > [  425.262735]  gup_pgd_range+0x517/0x5a0
> > [  425.263819]  internal_get_user_pages_fast+0x210/0x250
> > [  425.265193]  ib_umem_get+0x298/0x550 [ib_uverbs]
> > [  425.266476]  mr_umem_get+0xc9/0x260 [mlx5_ib]
> > [  425.267699]  mlx5_ib_reg_user_mr+0xcc/0x7e0 [mlx5_ib]
> > [  425.269134]  ? xas_load+0x8/0x80
> > [  425.270074]  ? xa_load+0x48/0x90
> > [  425.271038]  ? lookup_get_idr_uobject.part.10+0x12/0x70 [ib_uverbs]
> > [  425.272757]  ib_uverbs_reg_mr+0x127/0x280 [ib_uverbs]
> > [  425.274120]  ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xc2/0xf0 [ib_uverbs]
> > [  425.276058]  ib_uverbs_cmd_verbs.isra.6+0x5be/0xbe0 [ib_uverbs]
> > [  425.277657]  ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs]
> > [  425.279155]  ? __alloc_pages_nodemask+0x148/0x2b0
> > [  425.280445]  ib_uverbs_ioctl+0xc0/0x120 [ib_uverbs]
> > [  425.281755]  do_vfs_ioctl+0x9d/0x650
> > [  425.282766]  ksys_ioctl+0x70/0x80
> > [  425.283745]  __x64_sys_ioctl+0x16/0x20
> > [  425.284912]  do_syscall_64+0x42/0x130
> > [  425.285973]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [  425.287377] RIP: 0033:0x7f780d2df267
> > [  425.288449] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48
> > [  425.293073] RSP: 002b:00007ffce49a88a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > [  425.295034] RAX: ffffffffffffffda RBX: 00007ffce49a8938 RCX: 00007f780d2df267
> > [  425.296895] RDX: 00007ffce49a8920 RSI: 00000000c0181b01 RDI: 0000000000000003
> > [  425.298689] RBP: 00007ffce49a8900 R08: 0000000000000003 R09: 00007f780d9a1010
> > [  425.300480] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f780d9a1150
> > [  425.302290] R13: 00007ffce49a8900 R14: 00007ffce49a8ad8 R15: 00007f780468a000
> > [  425.304113] ---[ end trace 1ecbefdb403190dd ]---
> > [  425.305434] ------------[ cut here ]------------
> > [  425.307147] WARNING: CPU: 1 PID: 6738 at mm/gup.c:150 try_grab_page+0x56/0x60
> > [  425.309111] Modules linked in: mlx5_ib mlx5_core mlxfw mlx4_ib mlx4_en ptp pps_core mlx4_core bonding ip6_gre ip6_tunnel tunnel6 ip_gre gre ip_tunnel rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm ib_uverbs ib_ipoib ib_umad ib_srp scsi_transport_srp rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core [last unloaded: mlxfw]
> > [  425.316461] CPU: 1 PID: 6738 Comm: consume_mtts Tainted: G        W  O      5.5.0-rc2+ #1
> > [  425.318582] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
> > [  425.320958] RIP: 0010:try_grab_page+0x56/0x60
> > [  425.322167] Code: 7e 28 f0 81 47 34 00 01 00 00 c3 48 8b 47 08 48 8d 50 ff a8 01 48 0f 45 fa 8b 47 34 85 c0 7e 0f f0 ff 47 34 b8 01 00 00 00 c3 <0f> 0b 31 c0 c3 0f 0b 31 c0 c3 0f 1f 44 00 00 41 57 41 56 41 55 41
> > [  425.326814] RSP: 0018:ffffc90000687830 EFLAGS: 00010282
> > [  425.328226] RAX: 0000000000000001 RBX: ffffea000dc58000 RCX: ffffea000e031087
> > [  425.330104] RDX: 0000000080000001 RSI: 0000000000040000 RDI: ffffea000dc58000
> > [  425.331980] RBP: 00007f7804800000 R08: 000ffffffffff000 R09: 80000003716000e7
> > [  425.333898] R10: ffff88834af80120 R11: ffff8883ac16f000 R12: ffff88834af80120
> > [  425.335704] R13: ffff88837c0915c0 R14: 0000000000050201 R15: 00007f7804800000
> > [  425.337638] FS:  00007f780d9d3740(0000) GS:ffff8883b1c80000(0000) knlGS:0000000000000000
> > [  425.339734] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  425.341369] CR2: 0000000002334048 CR3: 000000039c68c001 CR4: 00000000001606a0
> > [  425.343160] Call Trace:
> > [  425.343967]  follow_trans_huge_pmd+0x16f/0x2e0
> > [  425.345263]  follow_p4d_mask+0x51c/0x630
> > [  425.346344]  __get_user_pages+0x1a1/0x6c0
> > [  425.347463]  internal_get_user_pages_fast+0x17b/0x250
> > [  425.348918]  ib_umem_get+0x298/0x550 [ib_uverbs]
> > [  425.350174]  mr_umem_get+0xc9/0x260 [mlx5_ib]
> > [  425.351383]  mlx5_ib_reg_user_mr+0xcc/0x7e0 [mlx5_ib]
> > [  425.352849]  ? xas_load+0x8/0x80
> > [  425.353776]  ? xa_load+0x48/0x90
> > [  425.354730]  ? lookup_get_idr_uobject.part.10+0x12/0x70 [ib_uverbs]
> > [  425.356410]  ib_uverbs_reg_mr+0x127/0x280 [ib_uverbs]
> > [  425.357843]  ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xc2/0xf0 [ib_uverbs]
> > [  425.359749]  ib_uverbs_cmd_verbs.isra.6+0x5be/0xbe0 [ib_uverbs]
> > [  425.361405]  ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs]
> > [  425.362898]  ? __alloc_pages_nodemask+0x148/0x2b0
> > [  425.364206]  ib_uverbs_ioctl+0xc0/0x120 [ib_uverbs]
> > [  425.365564]  do_vfs_ioctl+0x9d/0x650
> > [  425.366567]  ksys_ioctl+0x70/0x80
> > [  425.367537]  __x64_sys_ioctl+0x16/0x20
> > [  425.368698]  do_syscall_64+0x42/0x130
> > [  425.369782]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [  425.371117] RIP: 0033:0x7f780d2df267
> > [  425.372159] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48
> > [  425.376774] RSP: 002b:00007ffce49a88a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > [  425.378740] RAX: ffffffffffffffda RBX: 00007ffce49a8938 RCX: 00007f780d2df267
> > [  425.380598] RDX: 00007ffce49a8920 RSI: 00000000c0181b01 RDI: 0000000000000003
> > [  425.382411] RBP: 00007ffce49a8900 R08: 0000000000000003 R09: 00007f780d9a1010
> > [  425.384312] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f780d9a1150
> > [  425.386132] R13: 00007ffce49a8900 R14: 00007ffce49a8ad8 R15: 00007f780468a000
> > [  425.387964] ---[ end trace 1ecbefdb403190de ]---
> >
> > Thanks
> >
> > >
> > >
> > >
> > > thanks,
> > > --
> > > John Hubbard
> > > NVIDIA

John Hubbard Dec. 27, 2019, 9:56 p.m. UTC | #25

On 12/24/19 9:26 PM, Leon Romanovsky wrote:
...
>>>> The branch is here (I just tested it and it seems healthy):
>>>>
>>>> git@github.com:johnhubbard/linux.git  pin_user_pages_tracking_v11_with_diags
>>>
>>> Hi,
>>>
>>> We tested the following branch and here comes results:
>>
>> Thanks for this testing run!
>>
>>> [root@server consume_mtts]# (master) $ grep foll_pin /proc/vmstat
>>> nr_foll_pin_requested 0
>>> nr_foll_pin_returned 0
>>>
>>
>> Zero pinned pages!
> 
> Maybe we are missing some CONFIG_* option?
> https://lore.kernel.org/linux-rdma/12a28917-f8c9-5092-2f01-92bb74714cae@nvidia.com/T/#mf900896f5dfc86cdee9246219990c632ed77115f

Ah OK, it must be that CONFIG_DEBUG_VM is not set, thanks!

>> ...now I'm confused. Somehow FOLL_PIN and pin_user_pages*() calls are
>> not happening. And although the backtraces below show some of my new
>> routines (like try_grab_page), they also confirm the above: there is no
>> pin_user_page*() call in the stack.
>>
>> In particular, it looks like ib_umem_get() is calling through to
>> get_user_pages*(), rather than pin_user_pages*(). I don't see how this
>> is possible, because the code on my screen shows ib_umem_get() calling
>> pin_user_pages_fast().
>>

It must be that pin_user_pages() is in the call stack, but just not getting
printed. There's no other way to explain this.

>> Any thoughts or ideas are welcome here.
>>
>> However, glossing over all of that and assuming that the new
>> GUP_PIN_COUNTING_BIAS of 256 is applied, it's interesting that we still
>> see any overflow. I'm less confident now that this is a true refcount
>> overflow.
> 
> Earlier in this email thread, I posted possible function call chain which
> doesn't involve refcount overflow, but for some reason the refcount
> overflow was chosen as a way to explore.
> 

Well, both of the WARN() calls are asserting that the refcount went negative
(well, one asserts negative, and the other asserts "<=0"). So it's pretty
hard to talk our way out of a refcount overflow here.

>>
>> Also, any information that would get me closer to being able to attempt
>> my own reproduction of the problem are *very* welcome. :)
> 
> It is ancient verification test (~10y) which is not an easy task to
> make it understandable and standalone :).
> 

Is this the only test that fails, btw? No other test failures or hints of
problems?

(Also, maybe hopeless, but can *anyone* on the RDMA list provide some
characterization of the test, such as how many pins per page, what page
sizes are used? I'm still hoping to write a test to trigger something
close to this...)

I do have a couple more ideas for test runs:

1. Reduce GUP_PIN_COUNTING_BIAS to 1. That would turn the whole override of
page->_refcount into a no-op, and so if all is well (it may not be!) with the
rest of the patch, then we'd expect this problem to not reappear.

2. Active /proc/vmstat *foll_pin* statistics unconditionally (just for these
tests, of course), so we can see if there is a get/put mismatch. However, that
will change the timing, and so it must be attempted independently of (1), in
order to see if it ends up hiding the repro.

I've updated this branch to implement (1), but not (2), hoping you can give
this one a spin?

     git@github.com:johnhubbard/linux.git  pin_user_pages_tracking_v11_with_diags

thanks,

John Hubbard Dec. 29, 2019, 4:33 a.m. UTC | #26

On 12/27/19 1:56 PM, John Hubbard wrote:
...
>> It is ancient verification test (~10y) which is not an easy task to
>> make it understandable and standalone :).
>>
> 
> Is this the only test that fails, btw? No other test failures or hints of
> problems?
> 
> (Also, maybe hopeless, but can *anyone* on the RDMA list provide some
> characterization of the test, such as how many pins per page, what page
> sizes are used? I'm still hoping to write a test to trigger something
> close to this...)
> 
> I do have a couple more ideas for test runs:
> 
> 1. Reduce GUP_PIN_COUNTING_BIAS to 1. That would turn the whole override of
> page->_refcount into a no-op, and so if all is well (it may not be!) with the
> rest of the patch, then we'd expect this problem to not reappear.
> 
> 2. Active /proc/vmstat *foll_pin* statistics unconditionally (just for these
> tests, of course), so we can see if there is a get/put mismatch. However, that
> will change the timing, and so it must be attempted independently of (1), in
> order to see if it ends up hiding the repro.
> 
> I've updated this branch to implement (1), but not (2), hoping you can give
> this one a spin?
> 
>     git@github.com:johnhubbard/linux.git  pin_user_pages_tracking_v11_with_diags
> 
> 

Also, looking ahead:

a) if the problem disappears with the latest above test, then we likely have
   a huge page refcount overflow, and there are a couple of different ways to
   fix it. 

b) if it still reproduces with the above, then it's some other random mistake,
   and in that case I'd be inclined to do a sort of guided (or classic, unguided)
   git bisect of the series. Because it could be any of several patches.

   If that's too much trouble, then I'd have to fall back to submitting a few
   patches at a time and working my way up to the tracking patch...

thanks,

Jan Kara Jan. 6, 2020, 9:01 a.m. UTC | #27

On Sat 28-12-19 20:33:32, John Hubbard wrote:
> On 12/27/19 1:56 PM, John Hubbard wrote:
> ...
> >> It is ancient verification test (~10y) which is not an easy task to
> >> make it understandable and standalone :).
> >>
> > 
> > Is this the only test that fails, btw? No other test failures or hints of
> > problems?
> > 
> > (Also, maybe hopeless, but can *anyone* on the RDMA list provide some
> > characterization of the test, such as how many pins per page, what page
> > sizes are used? I'm still hoping to write a test to trigger something
> > close to this...)
> > 
> > I do have a couple more ideas for test runs:
> > 
> > 1. Reduce GUP_PIN_COUNTING_BIAS to 1. That would turn the whole override of
> > page->_refcount into a no-op, and so if all is well (it may not be!) with the
> > rest of the patch, then we'd expect this problem to not reappear.
> > 
> > 2. Active /proc/vmstat *foll_pin* statistics unconditionally (just for these
> > tests, of course), so we can see if there is a get/put mismatch. However, that
> > will change the timing, and so it must be attempted independently of (1), in
> > order to see if it ends up hiding the repro.
> > 
> > I've updated this branch to implement (1), but not (2), hoping you can give
> > this one a spin?
> > 
> >     git@github.com:johnhubbard/linux.git  pin_user_pages_tracking_v11_with_diags
> > 
> > 
> 
> Also, looking ahead:
> 
> a) if the problem disappears with the latest above test, then we likely have
>    a huge page refcount overflow, and there are a couple of different ways to
>    fix it. 
> 
> b) if it still reproduces with the above, then it's some other random mistake,
>    and in that case I'd be inclined to do a sort of guided (or classic, unguided)
>    git bisect of the series. Because it could be any of several patches.
> 
>    If that's too much trouble, then I'd have to fall back to submitting a few
>    patches at a time and working my way up to the tracking patch...

It could also be that an ordinary page reference is dropped with 'unpin'
thus underflowing the page refcount...

								Honza

John Hubbard Jan. 7, 2020, 1:26 a.m. UTC | #28

On 1/6/20 1:01 AM, Jan Kara wrote:
...
>> Also, looking ahead:
>>
>> a) if the problem disappears with the latest above test, then we likely have
>>     a huge page refcount overflow, and there are a couple of different ways to
>>     fix it.
>>
>> b) if it still reproduces with the above, then it's some other random mistake,
>>     and in that case I'd be inclined to do a sort of guided (or classic, unguided)
>>     git bisect of the series. Because it could be any of several patches.
>>
>>     If that's too much trouble, then I'd have to fall back to submitting a few
>>     patches at a time and working my way up to the tracking patch...
> 
> It could also be that an ordinary page reference is dropped with 'unpin'
> thus underflowing the page refcount...
> 
> 								Honza
> 

Yes.

And, I think I'm about out of time for this release cycle, so I'm probably going to
submit the prerequisite patches (patches 1-10, or more boldly, 1-22), for candidates
for 5.6.


thanks,

[v11,00/25] mm/gup: track dma-pinned pages: FOLL_PIN

Message

Comments