Message ID | 20191216222537.491123-1-jhubbard@nvidia.com (mailing list archive) |
---|---|
Headers | show |
Series | mm/gup: track dma-pinned pages: FOLL_PIN | expand |
Hi! On Mon 16-12-19 14:25:12, John Hubbard wrote: > Hi, > > This implements an API naming change (put_user_page*() --> > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It > extends that tracking to a few select subsystems. More subsystems will > be added in follow up work. Just a note for Andrew and others watching this series: At this point I'm fine with the series so if someone still has some review feedback or wants to check the series, now is the right time. Otherwise I think Andrew can push the series to MM tree so that it will get wider testing exposure and is prepared for the next merge window. Honza
On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote: > Hi, > > This implements an API naming change (put_user_page*() --> > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It > extends that tracking to a few select subsystems. More subsystems will > be added in follow up work. Hi John, The patchset generates kernel panics in our IB testing. In our tests, we allocated single memory block and registered multiple MRs using the single block. The possible bad flow is: ib_umem_geti() -> pin_user_pages_fast(FOLL_WRITE) -> internal_get_user_pages_fast(FOLL_WRITE) -> gup_pgd_range() -> gup_huge_pd() -> gup_hugepte() -> try_grab_compound_head() -> 108 static __maybe_unused struct page *try_grab_compound_head(struct page *page, 109 int refs, 110 unsigned int flags) 111 { 112 if (flags & FOLL_GET) 113 return try_get_compound_head(page, refs); 114 else if (flags & FOLL_PIN) 115 return try_pin_compound_head(page, refs); 116 117 WARN_ON_ONCE(1); 118 return NULL; 119 } # (master) $ dmesg [10924.722220] mlx5_core 0000:00:08.0 eth2: Link up [10924.725383] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready [10960.902254] ------------[ cut here ]------------ [10960.905614] WARNING: CPU: 3 PID: 8838 at mm/gup.c:61 try_grab_compound_head+0x92/0xd0 [10960.907313] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp rpcrdma rdma_ucm ib_iser ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core kvm_intel mlx5_core rfkill mlxfw sunrpc virtio_net pci_hyperv_intf kvm irqbypass net_failover crc32_pclmul i2c_piix4 ptp crc32c_intel failover pcspkr ghash_clmulni_intel i2c_core pps_core sch_fq_codel ip_tables ata_generic pata_acpi serio_raw ata_piix floppy [last unloaded: mlxkvl] [10960.917806] CPU: 3 PID: 8838 Comm: consume_mtts Tainted: G OE 5.5.0-rc2-for-upstream-perf-2019-12-18_10-06-50-78 #1 [10960.920530] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [10960.923024] RIP: 0010:try_grab_compound_head+0x92/0xd0 [10960.924329] Code: e4 8d 14 06 48 8d 4f 34 f0 0f b1 57 34 0f 94 c2 84 d2 75 cb 85 c0 74 cd 8d 14 06 f0 0f b1 11 0f 94 c2 84 d2 75 b9 66 90 eb ea <0f> 0b 31 ff eb b7 85 c0 66 0f 1f 44 00 00 74 ab 8d 14 06 f0 0f b1 [10960.928512] RSP: 0018:ffffc9000129f880 EFLAGS: 00010082 [10960.929831] RAX: 0000000080000001 RBX: 00007f6397446000 RCX: 000fffffffe00000 [10960.931422] RDX: 0000000000040000 RSI: 0000000000011800 RDI: ffffea000f5d8000 [10960.933005] RBP: ffffc9000129f93c R08: ffffc9000129f93c R09: 0000000000200000 [10960.934584] R10: ffff88840774b200 R11: ffff888000000230 R12: 00007f6397446000 [10960.936212] R13: 0000000000000046 R14: 80000003d76000e7 R15: 0000000000000080 [10960.937793] FS: 00007f63a0590740(0000) GS:ffff88842f980000(0000) knlGS:0000000000000000 [10960.939962] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [10960.941367] CR2: 00000000023e9008 CR3: 0000000406d0a002 CR4: 00000000007606e0 [10960.942975] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [10960.944654] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [10960.946394] PKRU: 55555554 [10960.947310] Call Trace: [10960.948193] gup_pgd_range+0x61e/0x950 [10960.949585] internal_get_user_pages_fast+0x98/0x1c0 [10960.951313] ib_umem_get+0x2b3/0x5a0 [ib_uverbs] [10960.952929] mr_umem_get+0xd8/0x280 [mlx5_ib] [10960.954150] ? xas_store+0x49/0x550 [10960.955187] mlx5_ib_reg_user_mr+0x149/0x7a0 [mlx5_ib] [10960.956478] ? xas_load+0x9/0x80 [10960.957474] ? xa_load+0x54/0x90 [10960.958465] ? lookup_get_idr_uobject.part.10+0x12/0x80 [ib_uverbs] [10960.959926] ib_uverbs_reg_mr+0x138/0x2a0 [ib_uverbs] [10960.961192] ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xb1/0xf0 [ib_uverbs] [10960.963208] ib_uverbs_cmd_verbs.isra.8+0x997/0xb30 [ib_uverbs] [10960.964603] ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs] [10960.965949] ? mem_cgroup_commit_charge+0x6a/0x140 [10960.967177] ? page_add_new_anon_rmap+0x58/0xc0 [10960.968360] ib_uverbs_ioctl+0xbc/0x130 [ib_uverbs] [10960.969595] do_vfs_ioctl+0xa6/0x640 [10960.970631] ? syscall_trace_enter+0x1f8/0x2e0 [10960.971829] ksys_ioctl+0x60/0x90 [10960.972825] __x64_sys_ioctl+0x16/0x20 [10960.973888] do_syscall_64+0x48/0x130 [10960.974949] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [10960.976219] RIP: 0033:0x7f639fe9b267 [10960.977260] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48 [10960.981413] RSP: 002b:00007fff5335ca08 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [10960.983472] RAX: ffffffffffffffda RBX: 00007fff5335ca98 RCX: 00007f639fe9b267 [10960.985037] RDX: 00007fff5335ca80 RSI: 00000000c0181b01 RDI: 0000000000000003 [10960.986603] RBP: 00007fff5335ca60 R08: 0000000000000003 R09: 00007f63a055e010 [10960.988194] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f63a055e150 [10960.989903] R13: 00007fff5335ca60 R14: 00007fff5335cc38 R15: 00007f6397246000 [10960.991544] ---[ end trace 1f0ee07a75a16a93 ]--- [10960.992773] ------------[ cut here ]------------ [10960.993995] WARNING: CPU: 3 PID: 8838 at mm/gup.c:150 try_grab_page+0x55/0x70 [10960.995758] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp rpcrdma rdma_ucm ib_iser ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core kvm_intel mlx5_core rfkill mlxfw sunrpc virtio_net pci_hyperv_intf kvm irqbypass net_failover crc32_pclmul i2c_piix4 ptp crc32c_intel failover pcspkr ghash_clmulni_intel i2c_core pps_core sch_fq_codel ip_tables ata_generic pata_acpi serio_raw ata_piix floppy [last unloaded: mlxkvl] [10961.008579] CPU: 3 PID: 8838 Comm: consume_mtts Tainted: G W OE 5.5.0-rc2-for-upstream-perf-2019-12-18_10-06-50-78 #1 [10961.011416] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [10961.013766] RIP: 0010:try_grab_page+0x55/0x70 [10961.014921] Code: 00 04 00 00 b8 01 00 00 00 f3 c3 48 8b 47 08 a8 01 75 1c 8b 47 34 85 c0 7e 1d f0 ff 47 34 b8 01 00 00 00 c3 48 8d 78 ff eb cb <0f> 0b 31 c0 c3 48 8d 78 ff 66 90 eb dc 0f 0b 31 c0 c3 66 0f 1f 84 [10961.019058] RSP: 0018:ffffc9000129f7e8 EFLAGS: 00010282 [10961.020351] RAX: 0000000080000001 RBX: 0000000000050201 RCX: 000000000f5d8000 [10961.021921] RDX: 000ffffffffff000 RSI: 0000000000040000 RDI: ffffea000f5d8000 [10961.023494] RBP: 00007f6397400000 R08: ffffea000f986cc0 R09: ffff8883c758bdd0 [10961.025067] R10: 0000000000000001 R11: ffff888000000230 R12: ffff888407701c00 [10961.026637] R13: ffff8883e61b35d0 R14: ffffea000f5d8000 R15: 0000000000050201 [10961.028217] FS: 00007f63a0590740(0000) GS:ffff88842f980000(0000) knlGS:0000000000000000 [10961.030353] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [10961.031721] CR2: 00000000023e9008 CR3: 0000000406d0a002 CR4: 00000000007606e0 [10961.033305] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [10961.034884] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [10961.036456] PKRU: 55555554 [10961.037369] Call Trace: [10961.038285] follow_trans_huge_pmd+0x10c/0x300 [10961.039555] follow_page_mask+0x64a/0x760 [10961.040762] __get_user_pages+0xf1/0x720 [10961.041851] ? apic_timer_interrupt+0xa/0x20 [10961.042996] internal_get_user_pages_fast+0x14b/0x1c0 [10961.044266] ib_umem_get+0x2b3/0x5a0 [ib_uverbs] [10961.045474] mr_umem_get+0xd8/0x280 [mlx5_ib] [10961.046652] ? xas_store+0x49/0x550 [10961.047696] mlx5_ib_reg_user_mr+0x149/0x7a0 [mlx5_ib] [10961.048967] ? xas_load+0x9/0x80 [10961.049949] ? xa_load+0x54/0x90 [10961.050935] ? lookup_get_idr_uobject.part.10+0x12/0x80 [ib_uverbs] [10961.052378] ib_uverbs_reg_mr+0x138/0x2a0 [ib_uverbs] [10961.053635] ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xb1/0xf0 [ib_uverbs] [10961.055646] ib_uverbs_cmd_verbs.isra.8+0x997/0xb30 [ib_uverbs] [10961.057033] ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs] [10961.058381] ? mem_cgroup_commit_charge+0x6a/0x140 [10961.059611] ? page_add_new_anon_rmap+0x58/0xc0 [10961.060796] ib_uverbs_ioctl+0xbc/0x130 [ib_uverbs] [10961.062034] do_vfs_ioctl+0xa6/0x640 [10961.063081] ? syscall_trace_enter+0x1f8/0x2e0 [10961.064253] ksys_ioctl+0x60/0x90 [10961.065252] __x64_sys_ioctl+0x16/0x20 [10961.066315] do_syscall_64+0x48/0x130 [10961.067382] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [10961.068647] RIP: 0033:0x7f639fe9b267 [10961.069691] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48 [10961.073882] RSP: 002b:00007fff5335ca08 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [10961.075949] RAX: ffffffffffffffda RBX: 00007fff5335ca98 RCX: 00007f639fe9b267 [10961.077545] RDX: 00007fff5335ca80 RSI: 00000000c0181b01 RDI: 0000000000000003 [10961.079128] RBP: 00007fff5335ca60 R08: 0000000000000003 R09: 00007f63a055e010 [10961.080709] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f63a055e150 [10961.082278] R13: 00007fff5335ca60 R14: 00007fff5335cc38 R15: 00007f6397246000 [10961.083873] ---[ end trace 1f0ee07a75a16a94 ]--- Thanks
On 12/19/19 5:26 AM, Leon Romanovsky wrote: > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote: >> Hi, >> >> This implements an API naming change (put_user_page*() --> >> unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It >> extends that tracking to a few select subsystems. More subsystems will >> be added in follow up work. > > Hi John, > > The patchset generates kernel panics in our IB testing. In our tests, we > allocated single memory block and registered multiple MRs using the single > block. > > The possible bad flow is: > ib_umem_geti() -> > pin_user_pages_fast(FOLL_WRITE) -> > internal_get_user_pages_fast(FOLL_WRITE) -> > gup_pgd_range() -> > gup_huge_pd() -> > gup_hugepte() -> > try_grab_compound_head() -> Hi Leon, Thanks very much for the detailed report! So we're overflowing... At first look, this seems likely to be hitting a weak point in the GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred (there's a writeup in Documentation/core-api/pin_user_page.rst, lines 99-121). Basically it's pretty easy to overflow the page->_refcount with huge pages if the pages have a *lot* of subpages. We can only do about 7 pins on 1GB huge pages that use 4KB subpages. Do you have any idea how many pins (repeated pins on the same page, which it sounds like you have) might be involved in your test case, and the huge page and system page sizes? That would allow calculating if we're likely overflowing for that reason. So, ideas and next steps: 1. Assuming that you *are* hitting this, I think I may have to fall back to implementing the "deferred" part of this design, as part of this series, after all. That means: For the pin/unpin calls at least, stop treating all pages as if they are a cluster of PAGE_SIZE pages; instead, retrieve a huge page as one page. That's not how it works now, and the need to hand back a huge array of subpages is part of the problem. This affects the callers too, so it's not a super quick change to make. (I was really hoping not to have to do this yet.) 2. OR, maybe if you're *close* the the overflow, I could buy some development time by moving the boundary between pinned vs get_page() refcounts, by reducing GUP_PIN_COUNTING_BIAS. That's less robust, but I don't want to rule it out just yet. After all, 1024 is a big chunk to count up with, and even if get_page() calls add up to, say, 512 refs on a page, it's still just a false positive on page_dma_pinned(). And false positives, if transient, are OK. 3. It would be nice if I could reproduce this. I have a two-node mlx5 Infiniband test setup, but I have done only the tiniest bit of user space IB coding, so if you have any test programs that aren't too hard to deal with that could possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind that I'm not an advanced IB programmer. At all. :) 4. (minor note to self) This also uncovers a minor weakness in diagnostics: there's no page dump in these reports, because I chickened out and didn't include my WARN_ONCE_PAGE() macro that would have provided it. Although, even without it, it's obvious that this is a page overflow. thanks,
On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote: > On 12/19/19 5:26 AM, Leon Romanovsky wrote: > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote: > > > Hi, > > > > > > This implements an API naming change (put_user_page*() --> > > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It > > > extends that tracking to a few select subsystems. More subsystems will > > > be added in follow up work. > > > > Hi John, > > > > The patchset generates kernel panics in our IB testing. In our tests, we > > allocated single memory block and registered multiple MRs using the single > > block. > > > > The possible bad flow is: > > ib_umem_geti() -> > > pin_user_pages_fast(FOLL_WRITE) -> > > internal_get_user_pages_fast(FOLL_WRITE) -> > > gup_pgd_range() -> > > gup_huge_pd() -> > > gup_hugepte() -> > > try_grab_compound_head() -> > > Hi Leon, > > Thanks very much for the detailed report! So we're overflowing... > > At first look, this seems likely to be hitting a weak point in the > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines > 99-121). Basically it's pretty easy to overflow the page->_refcount > with huge pages if the pages have a *lot* of subpages. > > We can only do about 7 pins on 1GB huge pages that use 4KB subpages. Considering that establishing these pins is entirely under user control, we can't have a limit here. If the number of allowed pins are exhausted then the pin_user_pages_fast() must fail back to the user. > 3. It would be nice if I could reproduce this. I have a two-node mlx5 Infiniband > test setup, but I have done only the tiniest bit of user space IB coding, so > if you have any test programs that aren't too hard to deal with that could > possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind > that I'm not an advanced IB programmer. At all. :) Clone this: https://github.com/linux-rdma/rdma-core.git Install all the required deps to build it (notably cython), see the README.md $ ./build.sh $ build/bin/run_tests.py If you get things that far I think Leon can get a reproduction for you Jason
On 12/19/19 1:07 PM, Jason Gunthorpe wrote: > On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote: >> On 12/19/19 5:26 AM, Leon Romanovsky wrote: >>> On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote: >>>> Hi, >>>> >>>> This implements an API naming change (put_user_page*() --> >>>> unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It >>>> extends that tracking to a few select subsystems. More subsystems will >>>> be added in follow up work. >>> >>> Hi John, >>> >>> The patchset generates kernel panics in our IB testing. In our tests, we >>> allocated single memory block and registered multiple MRs using the single >>> block. >>> >>> The possible bad flow is: >>> ib_umem_geti() -> >>> pin_user_pages_fast(FOLL_WRITE) -> >>> internal_get_user_pages_fast(FOLL_WRITE) -> >>> gup_pgd_range() -> >>> gup_huge_pd() -> >>> gup_hugepte() -> >>> try_grab_compound_head() -> >> >> Hi Leon, >> >> Thanks very much for the detailed report! So we're overflowing... >> >> At first look, this seems likely to be hitting a weak point in the >> GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred >> (there's a writeup in Documentation/core-api/pin_user_page.rst, lines >> 99-121). Basically it's pretty easy to overflow the page->_refcount >> with huge pages if the pages have a *lot* of subpages. >> >> We can only do about 7 pins on 1GB huge pages that use 4KB subpages. > > Considering that establishing these pins is entirely under user > control, we can't have a limit here. There's already a limit, it's just a much larger one. :) What does "no limit" really mean, numerically, to you in this case? > > If the number of allowed pins are exhausted then the > pin_user_pages_fast() must fail back to the user. I'll poke around the IB call stack and see how much of that return path is in place, if any. Because it's the same situation for get_user_pages_fast(). This code just added a warning on overflow so we could spot it early. > >> 3. It would be nice if I could reproduce this. I have a two-node mlx5 Infiniband >> test setup, but I have done only the tiniest bit of user space IB coding, so >> if you have any test programs that aren't too hard to deal with that could >> possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind >> that I'm not an advanced IB programmer. At all. :) > > Clone this: > > https://github.com/linux-rdma/rdma-core.git > > Install all the required deps to build it (notably cython), see the README.md > > $ ./build.sh > $ build/bin/run_tests.py > > If you get things that far I think Leon can get a reproduction for you > OK, here goes. thanks,
On 12/19/19 1:07 PM, Jason Gunthorpe wrote: ... >> 3. It would be nice if I could reproduce this. I have a two-node mlx5 Infiniband >> test setup, but I have done only the tiniest bit of user space IB coding, so >> if you have any test programs that aren't too hard to deal with that could >> possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind >> that I'm not an advanced IB programmer. At all. :) > > Clone this: > > https://github.com/linux-rdma/rdma-core.git > > Install all the required deps to build it (notably cython), see the README.md > > $ ./build.sh > $ build/bin/run_tests.py > > If you get things that far I think Leon can get a reproduction for you > Cool, it's up and running (1 failure, 3 skipped, out of 67 tests). This is a great test suite to have running, I'll add it to my scripts. Here's the full output in case the failure or skip cases are a problem: $ sudo ./build/bin/run_tests.py --verbose test_create_ah (tests.test_addr.AHTest) ... ok test_create_ah_roce (tests.test_addr.AHTest) ... skipped "Can't run RoCE tests on IB link layer" test_destroy_ah (tests.test_addr.AHTest) ... ok test_create_comp_channel (tests.test_cq.CCTest) ... ok test_destroy_comp_channel (tests.test_cq.CCTest) ... ok test_create_cq_ex (tests.test_cq.CQEXTest) ... ok test_create_cq_ex_bad_flow (tests.test_cq.CQEXTest) ... ok test_destroy_cq_ex (tests.test_cq.CQEXTest) ... ok test_create_cq (tests.test_cq.CQTest) ... ok test_create_cq_bad_flow (tests.test_cq.CQTest) ... ok test_destroy_cq (tests.test_cq.CQTest) ... ok test_rc_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok test_ud_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok test_xrc_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok test_create_dm (tests.test_device.DMTest) ... ok test_create_dm_bad_flow (tests.test_device.DMTest) ... ok test_destroy_dm (tests.test_device.DMTest) ... ok test_destroy_dm_bad_flow (tests.test_device.DMTest) ... ok test_dm_read (tests.test_device.DMTest) ... ok test_dm_write (tests.test_device.DMTest) ... ok test_dm_write_bad_flow (tests.test_device.DMTest) ... ok test_dev_list (tests.test_device.DeviceTest) ... ok test_open_dev (tests.test_device.DeviceTest) ... ok test_query_device (tests.test_device.DeviceTest) ... ok test_query_device_ex (tests.test_device.DeviceTest) ... ok test_query_gid (tests.test_device.DeviceTest) ... ok test_query_port (tests.test_device.DeviceTest) ... FAIL test_query_port_bad_flow (tests.test_device.DeviceTest) ... ok test_create_dm_mr (tests.test_mr.DMMRTest) ... ok test_destroy_dm_mr (tests.test_mr.DMMRTest) ... ok test_buffer (tests.test_mr.MRTest) ... ok test_dereg_mr (tests.test_mr.MRTest) ... ok test_dereg_mr_twice (tests.test_mr.MRTest) ... ok test_lkey (tests.test_mr.MRTest) ... ok test_read (tests.test_mr.MRTest) ... ok test_reg_mr (tests.test_mr.MRTest) ... ok test_reg_mr_bad_flags (tests.test_mr.MRTest) ... ok test_reg_mr_bad_flow (tests.test_mr.MRTest) ... ok test_rkey (tests.test_mr.MRTest) ... ok test_write (tests.test_mr.MRTest) ... ok test_dereg_mw_type1 (tests.test_mr.MWTest) ... ok test_dereg_mw_type2 (tests.test_mr.MWTest) ... ok test_reg_mw_type1 (tests.test_mr.MWTest) ... ok test_reg_mw_type2 (tests.test_mr.MWTest) ... ok test_reg_mw_wrong_type (tests.test_mr.MWTest) ... ok test_odp_rc_traffic (tests.test_odp.OdpTestCase) ... ok test_odp_ud_traffic (tests.test_odp.OdpTestCase) ... skipped 'ODP is not supported - ODP recv not supported' test_odp_xrc_traffic (tests.test_odp.OdpTestCase) ... ok test_default_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok test_mem_align_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok test_without_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok test_alloc_pd (tests.test_pd.PDTest) ... ok test_create_pd_none_ctx (tests.test_pd.PDTest) ... ok test_dealloc_pd (tests.test_pd.PDTest) ... ok test_destroy_pd_twice (tests.test_pd.PDTest) ... ok test_multiple_pd_creation (tests.test_pd.PDTest) ... ok test_create_qp_ex_no_attr (tests.test_qp.QPTest) ... ok test_create_qp_ex_no_attr_connected (tests.test_qp.QPTest) ... ok test_create_qp_ex_with_attr (tests.test_qp.QPTest) ... ok test_create_qp_ex_with_attr_connected (tests.test_qp.QPTest) ... ok test_create_qp_no_attr (tests.test_qp.QPTest) ... ok test_create_qp_no_attr_connected (tests.test_qp.QPTest) ... ok test_create_qp_with_attr (tests.test_qp.QPTest) ... ok test_create_qp_with_attr_connected (tests.test_qp.QPTest) ... ok test_modify_qp (tests.test_qp.QPTest) ... ok test_query_qp (tests.test_qp.QPTest) ... ok test_rdmacm_sync_traffic (tests.test_rdmacm.CMTestCase) ... skipped 'No devices with net interface' ====================================================================== FAIL: test_query_port (tests.test_device.DeviceTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/kernel_work/rdma-core/tests/test_device.py", line 129, in test_query_port self.verify_port_attr(port_attr) File "/kernel_work/rdma-core/tests/test_device.py", line 113, in verify_port_attr assert 'Invalid' not in d.speed_to_str(attr.active_speed) AssertionError ---------------------------------------------------------------------- Ran 67 tests in 10.058s FAILED (failures=1, skipped=3) thanks,
On Thu 19-12-19 12:30:31, John Hubbard wrote: > On 12/19/19 5:26 AM, Leon Romanovsky wrote: > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote: > > > Hi, > > > > > > This implements an API naming change (put_user_page*() --> > > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It > > > extends that tracking to a few select subsystems. More subsystems will > > > be added in follow up work. > > > > Hi John, > > > > The patchset generates kernel panics in our IB testing. In our tests, we > > allocated single memory block and registered multiple MRs using the single > > block. > > > > The possible bad flow is: > > ib_umem_geti() -> > > pin_user_pages_fast(FOLL_WRITE) -> > > internal_get_user_pages_fast(FOLL_WRITE) -> > > gup_pgd_range() -> > > gup_huge_pd() -> > > gup_hugepte() -> > > try_grab_compound_head() -> > > Hi Leon, > > Thanks very much for the detailed report! So we're overflowing... > > At first look, this seems likely to be hitting a weak point in the > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines > 99-121). Basically it's pretty easy to overflow the page->_refcount > with huge pages if the pages have a *lot* of subpages. > > We can only do about 7 pins on 1GB huge pages that use 4KB subpages. > Do you have any idea how many pins (repeated pins on the same page, which > it sounds like you have) might be involved in your test case, > and the huge page and system page sizes? That would allow calculating > if we're likely overflowing for that reason. > > So, ideas and next steps: > > 1. Assuming that you *are* hitting this, I think I may have to fall back to > implementing the "deferred" part of this design, as part of this series, after > all. That means: > > For the pin/unpin calls at least, stop treating all pages as if they are > a cluster of PAGE_SIZE pages; instead, retrieve a huge page as one page. > That's not how it works now, and the need to hand back a huge array of > subpages is part of the problem. This affects the callers too, so it's not > a super quick change to make. (I was really hoping not to have to do this > yet.) Does that mean that you would need to make all GUP users huge page aware? Otherwise I don't see how what you suggest would work... And I don't think making all GUP users huge page aware is realistic (effort-wise) or even wanted (maintenance overhead in all those places). I believe there might be also a different solution for this: For transparent huge pages, we could find a space in 'struct page' of the second page in the huge page for proper pin counter and just account pins there so we'd have full width of 32-bits for it. Honza
On Thu, Dec 19, 2019 at 01:13:54PM -0800, John Hubbard wrote: > On 12/19/19 1:07 PM, Jason Gunthorpe wrote: > > On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote: > > > On 12/19/19 5:26 AM, Leon Romanovsky wrote: > > > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote: > > > > > Hi, > > > > > > > > > > This implements an API naming change (put_user_page*() --> > > > > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It > > > > > extends that tracking to a few select subsystems. More subsystems will > > > > > be added in follow up work. > > > > > > > > Hi John, > > > > > > > > The patchset generates kernel panics in our IB testing. In our tests, we > > > > allocated single memory block and registered multiple MRs using the single > > > > block. > > > > > > > > The possible bad flow is: > > > > ib_umem_geti() -> > > > > pin_user_pages_fast(FOLL_WRITE) -> > > > > internal_get_user_pages_fast(FOLL_WRITE) -> > > > > gup_pgd_range() -> > > > > gup_huge_pd() -> > > > > gup_hugepte() -> > > > > try_grab_compound_head() -> > > > > > > Hi Leon, > > > > > > Thanks very much for the detailed report! So we're overflowing... > > > > > > At first look, this seems likely to be hitting a weak point in the > > > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred > > > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines > > > 99-121). Basically it's pretty easy to overflow the page->_refcount > > > with huge pages if the pages have a *lot* of subpages. > > > > > > We can only do about 7 pins on 1GB huge pages that use 4KB subpages. > > > > Considering that establishing these pins is entirely under user > > control, we can't have a limit here. > > There's already a limit, it's just a much larger one. :) What does "no limit" > really mean, numerically, to you in this case? I guess I mean 'hidden limit' - hitting the limit and failing would be managable. I think 7 is probably too low though, but we are not using 1GB huge pages, only 2M.. > > If the number of allowed pins are exhausted then the > > pin_user_pages_fast() must fail back to the user. > > I'll poke around the IB call stack and see how much of that return > path is in place, if any. Because it's the same situation for > get_user_pages_fast(). This code just added a warning on overflow > so we could spot it early. All GUP callers must be prepared for failure, IB should be fine... Jason
On Thu, Dec 19, 2019 at 05:07:43PM -0400, Jason Gunthorpe wrote: > On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote: > > On 12/19/19 5:26 AM, Leon Romanovsky wrote: > > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote: > > > > Hi, > > > > > > > > This implements an API naming change (put_user_page*() --> > > > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It > > > > extends that tracking to a few select subsystems. More subsystems will > > > > be added in follow up work. > > > > > > Hi John, > > > > > > The patchset generates kernel panics in our IB testing. In our tests, we > > > allocated single memory block and registered multiple MRs using the single > > > block. > > > > > > The possible bad flow is: > > > ib_umem_geti() -> > > > pin_user_pages_fast(FOLL_WRITE) -> > > > internal_get_user_pages_fast(FOLL_WRITE) -> > > > gup_pgd_range() -> > > > gup_huge_pd() -> > > > gup_hugepte() -> > > > try_grab_compound_head() -> > > > > Hi Leon, > > > > Thanks very much for the detailed report! So we're overflowing... > > > > At first look, this seems likely to be hitting a weak point in the > > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred > > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines > > 99-121). Basically it's pretty easy to overflow the page->_refcount > > with huge pages if the pages have a *lot* of subpages. > > > > We can only do about 7 pins on 1GB huge pages that use 4KB subpages. > > Considering that establishing these pins is entirely under user > control, we can't have a limit here. > > If the number of allowed pins are exhausted then the > pin_user_pages_fast() must fail back to the user. > > > 3. It would be nice if I could reproduce this. I have a two-node mlx5 Infiniband > > test setup, but I have done only the tiniest bit of user space IB coding, so > > if you have any test programs that aren't too hard to deal with that could > > possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind > > that I'm not an advanced IB programmer. At all. :) > > Clone this: > > https://github.com/linux-rdma/rdma-core.git > > Install all the required deps to build it (notably cython), see the README.md > > $ ./build.sh > $ build/bin/run_tests.py > > If you get things that far I think Leon can get a reproduction for you I'm not so optimistic about that. Thanks > > Jason
On Thu, Dec 19, 2019 at 02:58:43PM -0800, John Hubbard wrote: > On 12/19/19 1:07 PM, Jason Gunthorpe wrote: > ... > > > 3. It would be nice if I could reproduce this. I have a two-node mlx5 Infiniband > > > test setup, but I have done only the tiniest bit of user space IB coding, so > > > if you have any test programs that aren't too hard to deal with that could > > > possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind > > > that I'm not an advanced IB programmer. At all. :) > > > > Clone this: > > > > https://github.com/linux-rdma/rdma-core.git > > > > Install all the required deps to build it (notably cython), see the README.md > > > > $ ./build.sh > > $ build/bin/run_tests.py > > > > If you get things that far I think Leon can get a reproduction for you > > > > Cool, it's up and running (1 failure, 3 skipped, out of 67 tests). > > This is a great test suite to have running, I'll add it to my scripts. Here's the > full output in case the failure or skip cases are a problem: > > $ sudo ./build/bin/run_tests.py --verbose > > test_create_ah (tests.test_addr.AHTest) ... ok > test_create_ah_roce (tests.test_addr.AHTest) ... skipped "Can't run RoCE tests on IB link layer" > test_destroy_ah (tests.test_addr.AHTest) ... ok > test_create_comp_channel (tests.test_cq.CCTest) ... ok > test_destroy_comp_channel (tests.test_cq.CCTest) ... ok > test_create_cq_ex (tests.test_cq.CQEXTest) ... ok > test_create_cq_ex_bad_flow (tests.test_cq.CQEXTest) ... ok > test_destroy_cq_ex (tests.test_cq.CQEXTest) ... ok > test_create_cq (tests.test_cq.CQTest) ... ok > test_create_cq_bad_flow (tests.test_cq.CQTest) ... ok > test_destroy_cq (tests.test_cq.CQTest) ... ok > test_rc_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok > test_ud_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok > test_xrc_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok > test_create_dm (tests.test_device.DMTest) ... ok > test_create_dm_bad_flow (tests.test_device.DMTest) ... ok > test_destroy_dm (tests.test_device.DMTest) ... ok > test_destroy_dm_bad_flow (tests.test_device.DMTest) ... ok > test_dm_read (tests.test_device.DMTest) ... ok > test_dm_write (tests.test_device.DMTest) ... ok > test_dm_write_bad_flow (tests.test_device.DMTest) ... ok > test_dev_list (tests.test_device.DeviceTest) ... ok > test_open_dev (tests.test_device.DeviceTest) ... ok > test_query_device (tests.test_device.DeviceTest) ... ok > test_query_device_ex (tests.test_device.DeviceTest) ... ok > test_query_gid (tests.test_device.DeviceTest) ... ok > test_query_port (tests.test_device.DeviceTest) ... FAIL > test_query_port_bad_flow (tests.test_device.DeviceTest) ... ok > test_create_dm_mr (tests.test_mr.DMMRTest) ... ok > test_destroy_dm_mr (tests.test_mr.DMMRTest) ... ok > test_buffer (tests.test_mr.MRTest) ... ok > test_dereg_mr (tests.test_mr.MRTest) ... ok > test_dereg_mr_twice (tests.test_mr.MRTest) ... ok > test_lkey (tests.test_mr.MRTest) ... ok > test_read (tests.test_mr.MRTest) ... ok > test_reg_mr (tests.test_mr.MRTest) ... ok > test_reg_mr_bad_flags (tests.test_mr.MRTest) ... ok > test_reg_mr_bad_flow (tests.test_mr.MRTest) ... ok > test_rkey (tests.test_mr.MRTest) ... ok > test_write (tests.test_mr.MRTest) ... ok > test_dereg_mw_type1 (tests.test_mr.MWTest) ... ok > test_dereg_mw_type2 (tests.test_mr.MWTest) ... ok > test_reg_mw_type1 (tests.test_mr.MWTest) ... ok > test_reg_mw_type2 (tests.test_mr.MWTest) ... ok > test_reg_mw_wrong_type (tests.test_mr.MWTest) ... ok > test_odp_rc_traffic (tests.test_odp.OdpTestCase) ... ok > test_odp_ud_traffic (tests.test_odp.OdpTestCase) ... skipped 'ODP is not supported - ODP recv not supported' > test_odp_xrc_traffic (tests.test_odp.OdpTestCase) ... ok > test_default_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok > test_mem_align_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok > test_without_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok > test_alloc_pd (tests.test_pd.PDTest) ... ok > test_create_pd_none_ctx (tests.test_pd.PDTest) ... ok > test_dealloc_pd (tests.test_pd.PDTest) ... ok > test_destroy_pd_twice (tests.test_pd.PDTest) ... ok > test_multiple_pd_creation (tests.test_pd.PDTest) ... ok > test_create_qp_ex_no_attr (tests.test_qp.QPTest) ... ok > test_create_qp_ex_no_attr_connected (tests.test_qp.QPTest) ... ok > test_create_qp_ex_with_attr (tests.test_qp.QPTest) ... ok > test_create_qp_ex_with_attr_connected (tests.test_qp.QPTest) ... ok > test_create_qp_no_attr (tests.test_qp.QPTest) ... ok > test_create_qp_no_attr_connected (tests.test_qp.QPTest) ... ok > test_create_qp_with_attr (tests.test_qp.QPTest) ... ok > test_create_qp_with_attr_connected (tests.test_qp.QPTest) ... ok > test_modify_qp (tests.test_qp.QPTest) ... ok > test_query_qp (tests.test_qp.QPTest) ... ok > test_rdmacm_sync_traffic (tests.test_rdmacm.CMTestCase) ... skipped 'No devices with net interface' > > ====================================================================== > FAIL: test_query_port (tests.test_device.DeviceTest) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "/kernel_work/rdma-core/tests/test_device.py", line 129, in test_query_port > self.verify_port_attr(port_attr) > File "/kernel_work/rdma-core/tests/test_device.py", line 113, in verify_port_attr > assert 'Invalid' not in d.speed_to_str(attr.active_speed) > AssertionError I'm very curious how did you get this assert "d.speed_to_str" covers all known speeds according to the IBTA. Thanks > > ---------------------------------------------------------------------- > Ran 67 tests in 10.058s > > FAILED (failures=1, skipped=3) > > > thanks, > -- > John Hubbard > NVIDIA
On 12/20/19 10:48 AM, Leon Romanovsky wrote: ... >> test_query_qp (tests.test_qp.QPTest) ... ok >> test_rdmacm_sync_traffic (tests.test_rdmacm.CMTestCase) ... skipped 'No devices with net interface' >> >> ====================================================================== >> FAIL: test_query_port (tests.test_device.DeviceTest) >> ---------------------------------------------------------------------- >> Traceback (most recent call last): >> File "/kernel_work/rdma-core/tests/test_device.py", line 129, in test_query_port >> self.verify_port_attr(port_attr) >> File "/kernel_work/rdma-core/tests/test_device.py", line 113, in verify_port_attr >> assert 'Invalid' not in d.speed_to_str(attr.active_speed) >> AssertionError > > I'm very curious how did you get this assert "d.speed_to_str" covers all > known speeds according to the IBTA. > Hi Leon, Short answer: I can make that one pass, with a small fix the the rdma-core test suite: commit a1b9fb0846e1b2356d7a16f4fbdd1960cf8dcbe5 (HEAD -> fix_speed_to_str) Author: John Hubbard <jhubbard@nvidia.com> Date: Fri Dec 20 15:07:47 2019 -0800 device: fix speed_to_str(), to handle disabled links For disabled links, the raw speed token is 0. However, speed_to_str() doesn't have that in the list. This leads to an assertion when running tests (test_query_port) when one link is down and other link(s) are up. Fix this by returning '(Disabled/down)' for the zero speed case. diff --git a/pyverbs/device.pyx b/pyverbs/device.pyx index 33d133fd..f8b7826b 100755 --- a/pyverbs/device.pyx +++ b/pyverbs/device.pyx @@ -923,8 +923,8 @@ def width_to_str(width): def speed_to_str(speed): - l = {1: '2.5 Gbps', 2: '5.0 Gbps', 4: '5.0 Gbps', 8: '10.0 Gbps', - 16: '14.0 Gbps', 32: '25.0 Gbps', 64: '50.0 Gbps'} + l = {0: '(Disabled/down)', 1: '2.5 Gbps', 2: '5.0 Gbps', 4: '5.0 Gbps', + 8: '10.0 Gbps', 16: '14.0 Gbps', 32: '25.0 Gbps', 64: '50.0 Gbps'} try: return '{s} ({n})'.format(s=l[speed], n=speed) except KeyError: Longer answer: ============== It looks like this test suite assumes that every link is connected! (Probably in most test systems, they are.) But in my setup, the ConnectX cards each have two slots, and I only have (and only need) one cable. So one link is up, and the other is disabled. This leads to the other problem, which is that if a link is disabled, the test suite finds a "0" token for attr.active_speed. That token is not in the approved list, and so d.speed_to_str() asserts. With some diagnostics added, I can see it checking each link: one passes, and the other asserts: diff --git a/tests/test_device.py b/tests/test_device.py index 524e0e89..7b33d7db 100644 --- a/tests/test_device.py +++ b/tests/test_device.py @@ -110,6 +110,12 @@ class DeviceTest(unittest.TestCase): assert 'Invalid' not in d.translate_mtu(attr.max_mtu) assert 'Invalid' not in d.translate_mtu(attr.active_mtu) assert 'Invalid' not in d.width_to_str(attr.active_width) + print("") + print('Diagnostics ===========================================') + print('phys_state: ', d.phys_state_to_str(attr.phys_state)) + print('active_width): ', d.width_to_str(attr.active_width)) + print('active_speed: ', d.speed_to_str(attr.active_speed)) + print('END of Diagnostics ====================================') assert 'Invalid' not in d.speed_to_str(attr.active_speed) assert 'Invalid' not in d.translate_link_layer(attr.link_layer) assert attr.max_msg_sz > 0x1000 assert attr.max_msg_sz > 0x1000 ...and the test run from that is: # ./build/bin/run_tests.py --verbose tests.test_device.DeviceTest test_dev_list (tests.test_device.DeviceTest) ... ok test_open_dev (tests.test_device.DeviceTest) ... ok test_query_device (tests.test_device.DeviceTest) ... ok test_query_device_ex (tests.test_device.DeviceTest) ... ok test_query_gid (tests.test_device.DeviceTest) ... ok test_query_port (tests.test_device.DeviceTest) ... Diagnostics =========================================== phys_state: Link up (5) active_width): 4X (2) active_speed: 25.0 Gbps (32) END of Diagnostics ==================================== Diagnostics =========================================== phys_state: Disabled (3) active_width): 4X (2) active_speed: Invalid speed END of Diagnostics ==================================== FAIL test_query_port_bad_flow (tests.test_device.DeviceTest) ... ok ====================================================================== FAIL: test_query_port (tests.test_device.DeviceTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/kernel_work/rdma-core/tests/test_device.py", line 135, in test_query_port self.verify_port_attr(port_attr) File "/kernel_work/rdma-core/tests/test_device.py", line 119, in verify_port_attr assert 'Invalid' not in d.speed_to_str(attr.active_speed) AssertionError ---------------------------------------------------------------------- Ran 7 tests in 0.055s FAILED (failures=1) thanks,
On 12/20/19 10:29 AM, Leon Romanovsky wrote: ... >> $ ./build.sh >> $ build/bin/run_tests.py >> >> If you get things that far I think Leon can get a reproduction for you > > I'm not so optimistic about that. > OK, I'm going to proceed for now on the assumption that I've got an overflow problem that happens when huge pages are pinned. If I can get more information, great, otherwise it's probably enough. One thing: for your repro, if you know the huge page size, and the system page size for that case, that would really help. Also the number of pins per page, more or less, that you'd expect. Because Jason says that only 2M huge pages are used... Because the other possibility is that the refcount really is going negative, likely due to a mismatched pin/unpin somehow. If there's not an obvious repro case available, but you do have one (is it easy to repro, though?), then *if* you have the time, I could point you to a github branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this: diff --git a/include/linux/mm.h b/include/linux/mm.h index bb44c4d2ada7..8526fd03b978 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1077,7 +1077,7 @@ static inline void put_page(struct page *page) * get_user_pages and page_mkclean and other calls that race to set up page * table entries. */ -#define GUP_PIN_COUNTING_BIAS (1U << 10) +#define GUP_PIN_COUNTING_BIAS (1U << 8) void unpin_user_page(struct page *page); void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, If that fails to repro, then we would be zeroing in on the root cause. The branch is here (I just tested it and it seems healthy): git@github.com:johnhubbard/linux.git pin_user_pages_tracking_v11_with_diags thanks,
On 12/20/19 1:21 AM, Jan Kara wrote: ... >> So, ideas and next steps: >> >> 1. Assuming that you *are* hitting this, I think I may have to fall back to >> implementing the "deferred" part of this design, as part of this series, after >> all. That means: >> >> For the pin/unpin calls at least, stop treating all pages as if they are >> a cluster of PAGE_SIZE pages; instead, retrieve a huge page as one page. >> That's not how it works now, and the need to hand back a huge array of >> subpages is part of the problem. This affects the callers too, so it's not >> a super quick change to make. (I was really hoping not to have to do this >> yet.) > > Does that mean that you would need to make all GUP users huge page aware? > Otherwise I don't see how what you suggest would work... And I don't think > making all GUP users huge page aware is realistic (effort-wise) or even > wanted (maintenance overhead in all those places). > Well, pretty much yes. It's really just the pin_user_pages*() callers, but the internals, follow_page() and such, are so interconnected right now that it would probably blow up into a huge effort, as you point out. > I believe there might be also a different solution for this: For > transparent huge pages, we could find a space in 'struct page' of the > second page in the huge page for proper pin counter and just account pins > there so we'd have full width of 32-bits for it. > > Honza > OK, let me pursue that. Given that I shouldn't need to handle pages splitting, it should be not *too* bad. I am starting to think that I should just post the first 9 or so prerequisite patches (first 9 patches, plus the v4l2 fix that arguably should have been earlier in the sequence I guess), as 5.6 candidates, while I go back to the drawing board here. thanks,
On Fri, Dec 20, 2019 at 5:34 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Dec 19, 2019 at 01:13:54PM -0800, John Hubbard wrote: > > On 12/19/19 1:07 PM, Jason Gunthorpe wrote: > > > On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote: > > > > On 12/19/19 5:26 AM, Leon Romanovsky wrote: > > > > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote: > > > > > > Hi, > > > > > > > > > > > > This implements an API naming change (put_user_page*() --> > > > > > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It > > > > > > extends that tracking to a few select subsystems. More subsystems will > > > > > > be added in follow up work. > > > > > > > > > > Hi John, > > > > > > > > > > The patchset generates kernel panics in our IB testing. In our tests, we > > > > > allocated single memory block and registered multiple MRs using the single > > > > > block. > > > > > > > > > > The possible bad flow is: > > > > > ib_umem_geti() -> > > > > > pin_user_pages_fast(FOLL_WRITE) -> > > > > > internal_get_user_pages_fast(FOLL_WRITE) -> > > > > > gup_pgd_range() -> > > > > > gup_huge_pd() -> > > > > > gup_hugepte() -> > > > > > try_grab_compound_head() -> > > > > > > > > Hi Leon, > > > > > > > > Thanks very much for the detailed report! So we're overflowing... > > > > > > > > At first look, this seems likely to be hitting a weak point in the > > > > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred > > > > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines > > > > 99-121). Basically it's pretty easy to overflow the page->_refcount > > > > with huge pages if the pages have a *lot* of subpages. > > > > > > > > We can only do about 7 pins on 1GB huge pages that use 4KB subpages. > > > > > > Considering that establishing these pins is entirely under user > > > control, we can't have a limit here. > > > > There's already a limit, it's just a much larger one. :) What does "no limit" > > really mean, numerically, to you in this case? > > I guess I mean 'hidden limit' - hitting the limit and failing would > be managable. > > I think 7 is probably too low though, but we are not using 1GB huge > pages, only 2M.. What about RDMA to 1GB-hugetlbfs and 1GB-device-dax mappings?
On Fri, Dec 20, 2019 at 1:22 AM Jan Kara <jack@suse.cz> wrote: > > On Thu 19-12-19 12:30:31, John Hubbard wrote: > > On 12/19/19 5:26 AM, Leon Romanovsky wrote: > > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote: > > > > Hi, > > > > > > > > This implements an API naming change (put_user_page*() --> > > > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It > > > > extends that tracking to a few select subsystems. More subsystems will > > > > be added in follow up work. > > > > > > Hi John, > > > > > > The patchset generates kernel panics in our IB testing. In our tests, we > > > allocated single memory block and registered multiple MRs using the single > > > block. > > > > > > The possible bad flow is: > > > ib_umem_geti() -> > > > pin_user_pages_fast(FOLL_WRITE) -> > > > internal_get_user_pages_fast(FOLL_WRITE) -> > > > gup_pgd_range() -> > > > gup_huge_pd() -> > > > gup_hugepte() -> > > > try_grab_compound_head() -> > > > > Hi Leon, > > > > Thanks very much for the detailed report! So we're overflowing... > > > > At first look, this seems likely to be hitting a weak point in the > > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred > > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines > > 99-121). Basically it's pretty easy to overflow the page->_refcount > > with huge pages if the pages have a *lot* of subpages. > > > > We can only do about 7 pins on 1GB huge pages that use 4KB subpages. > > Do you have any idea how many pins (repeated pins on the same page, which > > it sounds like you have) might be involved in your test case, > > and the huge page and system page sizes? That would allow calculating > > if we're likely overflowing for that reason. > > > > So, ideas and next steps: > > > > 1. Assuming that you *are* hitting this, I think I may have to fall back to > > implementing the "deferred" part of this design, as part of this series, after > > all. That means: > > > > For the pin/unpin calls at least, stop treating all pages as if they are > > a cluster of PAGE_SIZE pages; instead, retrieve a huge page as one page. > > That's not how it works now, and the need to hand back a huge array of > > subpages is part of the problem. This affects the callers too, so it's not > > a super quick change to make. (I was really hoping not to have to do this > > yet.) > > Does that mean that you would need to make all GUP users huge page aware? > Otherwise I don't see how what you suggest would work... And I don't think > making all GUP users huge page aware is realistic (effort-wise) or even > wanted (maintenance overhead in all those places). > > I believe there might be also a different solution for this: For > transparent huge pages, we could find a space in 'struct page' of the > second page in the huge page for proper pin counter and just account pins > there so we'd have full width of 32-bits for it. That would require THP accounting for dax pages. It is something that was probably going to be needed, but this would seem to force the issue.
On 12/20/19 4:33 PM, Dan Williams wrote: ... >> I believe there might be also a different solution for this: For >> transparent huge pages, we could find a space in 'struct page' of the >> second page in the huge page for proper pin counter and just account pins >> there so we'd have full width of 32-bits for it. > > That would require THP accounting for dax pages. It is something that > was probably going to be needed, but this would seem to force the > issue. > Thanks for mentioning that, it wasn't obvious to me yet. How easy is it for mere mortals outside of Intel, to set up a DAX (nvdimm?) test setup? I'd hate to go into this without having that coverage up and running. It's been sketchy enough as it is. :) thanks,
On Fri, Dec 20, 2019 at 4:41 PM John Hubbard <jhubbard@nvidia.com> wrote: > > On 12/20/19 4:33 PM, Dan Williams wrote: > ... > >> I believe there might be also a different solution for this: For > >> transparent huge pages, we could find a space in 'struct page' of the > >> second page in the huge page for proper pin counter and just account pins > >> there so we'd have full width of 32-bits for it. > > > > That would require THP accounting for dax pages. It is something that > > was probably going to be needed, but this would seem to force the > > issue. > > > > Thanks for mentioning that, it wasn't obvious to me yet. > > How easy is it for mere mortals outside of Intel, to set up a DAX (nvdimm?) > test setup? I'd hate to go into this without having that coverage up > and running. It's been sketchy enough as it is. :) You too can have the power of the gods for the low low price of a kernel command line parameter, or a qemu setup. Details here: https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system https://nvdimm.wiki.kernel.org/pmem_in_qemu
On 12/20/19 4:51 PM, Dan Williams wrote: > On Fri, Dec 20, 2019 at 4:41 PM John Hubbard <jhubbard@nvidia.com> wrote: >> >> On 12/20/19 4:33 PM, Dan Williams wrote: >> ... >>>> I believe there might be also a different solution for this: For >>>> transparent huge pages, we could find a space in 'struct page' of the >>>> second page in the huge page for proper pin counter and just account pins >>>> there so we'd have full width of 32-bits for it. >>> >>> That would require THP accounting for dax pages. It is something that >>> was probably going to be needed, but this would seem to force the >>> issue. >>> >> >> Thanks for mentioning that, it wasn't obvious to me yet. >> >> How easy is it for mere mortals outside of Intel, to set up a DAX (nvdimm?) >> test setup? I'd hate to go into this without having that coverage up >> and running. It's been sketchy enough as it is. :) > > You too can have the power of the gods for the low low price of a > kernel command line parameter, or a qemu setup. > > Details here: > > https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system > https://nvdimm.wiki.kernel.org/pmem_in_qemu > Sweeeet! Now I can really cause some damage. :) thanks,
On Fri, Dec 20, 2019 at 03:54:55PM -0800, John Hubbard wrote: > On 12/20/19 10:29 AM, Leon Romanovsky wrote: > ... > >> $ ./build.sh > >> $ build/bin/run_tests.py > >> > >> If you get things that far I think Leon can get a reproduction for you > > > > I'm not so optimistic about that. > > > > OK, I'm going to proceed for now on the assumption that I've got an overflow > problem that happens when huge pages are pinned. If I can get more information, > great, otherwise it's probably enough. > > One thing: for your repro, if you know the huge page size, and the system > page size for that case, that would really help. Also the number of pins per > page, more or less, that you'd expect. Because Jason says that only 2M huge > pages are used... > > Because the other possibility is that the refcount really is going negative, > likely due to a mismatched pin/unpin somehow. > > If there's not an obvious repro case available, but you do have one (is it easy > to repro, though?), then *if* you have the time, I could point you to a github > branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this: I'll see what I can do this Sunday. > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index bb44c4d2ada7..8526fd03b978 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1077,7 +1077,7 @@ static inline void put_page(struct page *page) > * get_user_pages and page_mkclean and other calls that race to set up page > * table entries. > */ > -#define GUP_PIN_COUNTING_BIAS (1U << 10) > +#define GUP_PIN_COUNTING_BIAS (1U << 8) > > void unpin_user_page(struct page *page); > void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, > > If that fails to repro, then we would be zeroing in on the root cause. > > The branch is here (I just tested it and it seems healthy): > > git@github.com:johnhubbard/linux.git pin_user_pages_tracking_v11_with_diags > > > > thanks, > -- > John Hubbard > NVIDIA
On 12/21/19 2:08 AM, Leon Romanovsky wrote: > On Fri, Dec 20, 2019 at 03:54:55PM -0800, John Hubbard wrote: >> On 12/20/19 10:29 AM, Leon Romanovsky wrote: >> ... >>>> $ ./build.sh >>>> $ build/bin/run_tests.py >>>> >>>> If you get things that far I think Leon can get a reproduction for you >>> >>> I'm not so optimistic about that. >>> >> >> OK, I'm going to proceed for now on the assumption that I've got an overflow >> problem that happens when huge pages are pinned. If I can get more information, >> great, otherwise it's probably enough. >> >> One thing: for your repro, if you know the huge page size, and the system >> page size for that case, that would really help. Also the number of pins per >> page, more or less, that you'd expect. Because Jason says that only 2M huge >> pages are used... >> >> Because the other possibility is that the refcount really is going negative, >> likely due to a mismatched pin/unpin somehow. >> >> If there's not an obvious repro case available, but you do have one (is it easy >> to repro, though?), then *if* you have the time, I could point you to a github >> branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this: > > I'll see what I can do this Sunday. > The other data point that might shed light on whether it's a mismatch (this only works if the system is not actually crashing, though), is checking the new vmstat items, like this: $ grep foll_pin /proc/vmstat nr_foll_pin_requested 16288188 nr_foll_pin_returned 16288188 ...but OTOH, if you've got long-term pins, then those are *supposed* to be mismatched, so it only really helps in between tests. thanks,
On Fri, Dec 20, 2019 at 03:54:55PM -0800, John Hubbard wrote: > On 12/20/19 10:29 AM, Leon Romanovsky wrote: > ... > >> $ ./build.sh > >> $ build/bin/run_tests.py > >> > >> If you get things that far I think Leon can get a reproduction for you > > > > I'm not so optimistic about that. > > > > OK, I'm going to proceed for now on the assumption that I've got an overflow > problem that happens when huge pages are pinned. If I can get more information, > great, otherwise it's probably enough. > > One thing: for your repro, if you know the huge page size, and the system > page size for that case, that would really help. Also the number of pins per > page, more or less, that you'd expect. Because Jason says that only 2M huge > pages are used... > > Because the other possibility is that the refcount really is going negative, > likely due to a mismatched pin/unpin somehow. > > If there's not an obvious repro case available, but you do have one (is it easy > to repro, though?), then *if* you have the time, I could point you to a github > branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this: > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index bb44c4d2ada7..8526fd03b978 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1077,7 +1077,7 @@ static inline void put_page(struct page *page) > * get_user_pages and page_mkclean and other calls that race to set up page > * table entries. > */ > -#define GUP_PIN_COUNTING_BIAS (1U << 10) > +#define GUP_PIN_COUNTING_BIAS (1U << 8) > > void unpin_user_page(struct page *page); > void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, > > If that fails to repro, then we would be zeroing in on the root cause. > > The branch is here (I just tested it and it seems healthy): > > git@github.com:johnhubbard/linux.git pin_user_pages_tracking_v11_with_diags Hi, We tested the following branch and here comes results: [root@server consume_mtts]# (master) $ grep foll_pin /proc/vmstat nr_foll_pin_requested 0 nr_foll_pin_returned 0 [root@serer consume_mtts]# (master) $ dmesg [ 425.221459] ------------[ cut here ]------------ [ 425.225894] WARNING: CPU: 1 PID: 6738 at mm/gup.c:61 try_grab_compound_head+0x90/0xa0 [ 425.228021] Modules linked in: mlx5_ib mlx5_core mlxfw mlx4_ib mlx4_en ptp pps_core mlx4_core bonding ip6_gre ip6_tunnel tunnel6 ip_gre gre ip_tunnel rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm ib_uverbs ib_ipoib ib_umad ib_srp scsi_transport_srp rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core [last unloaded: mlxfw] [ 425.235266] CPU: 1 PID: 6738 Comm: consume_mtts Tainted: G O 5.5.0-rc2+ #1 [ 425.237480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 425.239738] RIP: 0010:try_grab_compound_head+0x90/0xa0 [ 425.241170] Code: 06 48 8d 4f 34 f0 0f b1 57 34 74 cd 85 c0 74 cf 8d 14 06 f0 0f b1 11 74 c0 eb f1 8d 14 06 f0 0f b1 11 74 b5 85 c0 75 f3 eb b5 <0f> 0b 31 c0 c3 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 [ 425.245739] RSP: 0018:ffffc900006878a8 EFLAGS: 00010082 [ 425.247124] RAX: 0000000080000001 RBX: 00007f780488a000 RCX: 0000000000000bb0 [ 425.248956] RDX: ffffea000e031087 RSI: 0000000000008a00 RDI: ffffea000dc58000 [ 425.250761] RBP: ffffea000e031080 R08: ffffc90000687974 R09: 000fffffffe00000 [ 425.252661] R10: 0000000000000000 R11: ffff888362560000 R12: 000000000000008a [ 425.254487] R13: 80000003716000e7 R14: 00007f780488a000 R15: ffffc90000687974 [ 425.256309] FS: 00007f780d9d3740(0000) GS:ffff8883b1c80000(0000) knlGS:0000000000000000 [ 425.258401] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 425.259949] CR2: 0000000002334048 CR3: 000000039c68c001 CR4: 00000000001606a0 [ 425.261884] Call Trace: [ 425.262735] gup_pgd_range+0x517/0x5a0 [ 425.263819] internal_get_user_pages_fast+0x210/0x250 [ 425.265193] ib_umem_get+0x298/0x550 [ib_uverbs] [ 425.266476] mr_umem_get+0xc9/0x260 [mlx5_ib] [ 425.267699] mlx5_ib_reg_user_mr+0xcc/0x7e0 [mlx5_ib] [ 425.269134] ? xas_load+0x8/0x80 [ 425.270074] ? xa_load+0x48/0x90 [ 425.271038] ? lookup_get_idr_uobject.part.10+0x12/0x70 [ib_uverbs] [ 425.272757] ib_uverbs_reg_mr+0x127/0x280 [ib_uverbs] [ 425.274120] ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xc2/0xf0 [ib_uverbs] [ 425.276058] ib_uverbs_cmd_verbs.isra.6+0x5be/0xbe0 [ib_uverbs] [ 425.277657] ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs] [ 425.279155] ? __alloc_pages_nodemask+0x148/0x2b0 [ 425.280445] ib_uverbs_ioctl+0xc0/0x120 [ib_uverbs] [ 425.281755] do_vfs_ioctl+0x9d/0x650 [ 425.282766] ksys_ioctl+0x70/0x80 [ 425.283745] __x64_sys_ioctl+0x16/0x20 [ 425.284912] do_syscall_64+0x42/0x130 [ 425.285973] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 425.287377] RIP: 0033:0x7f780d2df267 [ 425.288449] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48 [ 425.293073] RSP: 002b:00007ffce49a88a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 425.295034] RAX: ffffffffffffffda RBX: 00007ffce49a8938 RCX: 00007f780d2df267 [ 425.296895] RDX: 00007ffce49a8920 RSI: 00000000c0181b01 RDI: 0000000000000003 [ 425.298689] RBP: 00007ffce49a8900 R08: 0000000000000003 R09: 00007f780d9a1010 [ 425.300480] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f780d9a1150 [ 425.302290] R13: 00007ffce49a8900 R14: 00007ffce49a8ad8 R15: 00007f780468a000 [ 425.304113] ---[ end trace 1ecbefdb403190dd ]--- [ 425.305434] ------------[ cut here ]------------ [ 425.307147] WARNING: CPU: 1 PID: 6738 at mm/gup.c:150 try_grab_page+0x56/0x60 [ 425.309111] Modules linked in: mlx5_ib mlx5_core mlxfw mlx4_ib mlx4_en ptp pps_core mlx4_core bonding ip6_gre ip6_tunnel tunnel6 ip_gre gre ip_tunnel rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm ib_uverbs ib_ipoib ib_umad ib_srp scsi_transport_srp rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core [last unloaded: mlxfw] [ 425.316461] CPU: 1 PID: 6738 Comm: consume_mtts Tainted: G W O 5.5.0-rc2+ #1 [ 425.318582] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 425.320958] RIP: 0010:try_grab_page+0x56/0x60 [ 425.322167] Code: 7e 28 f0 81 47 34 00 01 00 00 c3 48 8b 47 08 48 8d 50 ff a8 01 48 0f 45 fa 8b 47 34 85 c0 7e 0f f0 ff 47 34 b8 01 00 00 00 c3 <0f> 0b 31 c0 c3 0f 0b 31 c0 c3 0f 1f 44 00 00 41 57 41 56 41 55 41 [ 425.326814] RSP: 0018:ffffc90000687830 EFLAGS: 00010282 [ 425.328226] RAX: 0000000000000001 RBX: ffffea000dc58000 RCX: ffffea000e031087 [ 425.330104] RDX: 0000000080000001 RSI: 0000000000040000 RDI: ffffea000dc58000 [ 425.331980] RBP: 00007f7804800000 R08: 000ffffffffff000 R09: 80000003716000e7 [ 425.333898] R10: ffff88834af80120 R11: ffff8883ac16f000 R12: ffff88834af80120 [ 425.335704] R13: ffff88837c0915c0 R14: 0000000000050201 R15: 00007f7804800000 [ 425.337638] FS: 00007f780d9d3740(0000) GS:ffff8883b1c80000(0000) knlGS:0000000000000000 [ 425.339734] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 425.341369] CR2: 0000000002334048 CR3: 000000039c68c001 CR4: 00000000001606a0 [ 425.343160] Call Trace: [ 425.343967] follow_trans_huge_pmd+0x16f/0x2e0 [ 425.345263] follow_p4d_mask+0x51c/0x630 [ 425.346344] __get_user_pages+0x1a1/0x6c0 [ 425.347463] internal_get_user_pages_fast+0x17b/0x250 [ 425.348918] ib_umem_get+0x298/0x550 [ib_uverbs] [ 425.350174] mr_umem_get+0xc9/0x260 [mlx5_ib] [ 425.351383] mlx5_ib_reg_user_mr+0xcc/0x7e0 [mlx5_ib] [ 425.352849] ? xas_load+0x8/0x80 [ 425.353776] ? xa_load+0x48/0x90 [ 425.354730] ? lookup_get_idr_uobject.part.10+0x12/0x70 [ib_uverbs] [ 425.356410] ib_uverbs_reg_mr+0x127/0x280 [ib_uverbs] [ 425.357843] ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xc2/0xf0 [ib_uverbs] [ 425.359749] ib_uverbs_cmd_verbs.isra.6+0x5be/0xbe0 [ib_uverbs] [ 425.361405] ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs] [ 425.362898] ? __alloc_pages_nodemask+0x148/0x2b0 [ 425.364206] ib_uverbs_ioctl+0xc0/0x120 [ib_uverbs] [ 425.365564] do_vfs_ioctl+0x9d/0x650 [ 425.366567] ksys_ioctl+0x70/0x80 [ 425.367537] __x64_sys_ioctl+0x16/0x20 [ 425.368698] do_syscall_64+0x42/0x130 [ 425.369782] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 425.371117] RIP: 0033:0x7f780d2df267 [ 425.372159] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48 [ 425.376774] RSP: 002b:00007ffce49a88a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 425.378740] RAX: ffffffffffffffda RBX: 00007ffce49a8938 RCX: 00007f780d2df267 [ 425.380598] RDX: 00007ffce49a8920 RSI: 00000000c0181b01 RDI: 0000000000000003 [ 425.382411] RBP: 00007ffce49a8900 R08: 0000000000000003 R09: 00007f780d9a1010 [ 425.384312] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f780d9a1150 [ 425.386132] R13: 00007ffce49a8900 R14: 00007ffce49a8ad8 R15: 00007f780468a000 [ 425.387964] ---[ end trace 1ecbefdb403190de ]--- Thanks > > > > thanks, > -- > John Hubbard > NVIDIA
On Fri, Dec 20, 2019 at 04:32:13PM -0800, Dan Williams wrote: > > > There's already a limit, it's just a much larger one. :) What does "no limit" > > > really mean, numerically, to you in this case? > > > > I guess I mean 'hidden limit' - hitting the limit and failing would > > be managable. > > > > I think 7 is probably too low though, but we are not using 1GB huge > > pages, only 2M.. > > What about RDMA to 1GB-hugetlbfs and 1GB-device-dax mappings? I don't think the failing testing is doing that. It is also less likely that 1GB regions will need multi-mapping, IMHO. Jason
On 12/22/19 5:23 AM, Leon Romanovsky wrote: > On Fri, Dec 20, 2019 at 03:54:55PM -0800, John Hubbard wrote: >> On 12/20/19 10:29 AM, Leon Romanovsky wrote: >> ... >>>> $ ./build.sh >>>> $ build/bin/run_tests.py >>>> >>>> If you get things that far I think Leon can get a reproduction for you >>> >>> I'm not so optimistic about that. >>> >> >> OK, I'm going to proceed for now on the assumption that I've got an overflow >> problem that happens when huge pages are pinned. If I can get more information, >> great, otherwise it's probably enough. >> >> One thing: for your repro, if you know the huge page size, and the system >> page size for that case, that would really help. Also the number of pins per >> page, more or less, that you'd expect. Because Jason says that only 2M huge >> pages are used... >> >> Because the other possibility is that the refcount really is going negative, >> likely due to a mismatched pin/unpin somehow. >> >> If there's not an obvious repro case available, but you do have one (is it easy >> to repro, though?), then *if* you have the time, I could point you to a github >> branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this: >> >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index bb44c4d2ada7..8526fd03b978 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -1077,7 +1077,7 @@ static inline void put_page(struct page *page) >> * get_user_pages and page_mkclean and other calls that race to set up page >> * table entries. >> */ >> -#define GUP_PIN_COUNTING_BIAS (1U << 10) >> +#define GUP_PIN_COUNTING_BIAS (1U << 8) >> >> void unpin_user_page(struct page *page); >> void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, >> >> If that fails to repro, then we would be zeroing in on the root cause. >> >> The branch is here (I just tested it and it seems healthy): >> >> git@github.com:johnhubbard/linux.git pin_user_pages_tracking_v11_with_diags > > Hi, > > We tested the following branch and here comes results: Thanks for this testing run! > [root@server consume_mtts]# (master) $ grep foll_pin /proc/vmstat > nr_foll_pin_requested 0 > nr_foll_pin_returned 0 > Zero pinned pages! ...now I'm confused. Somehow FOLL_PIN and pin_user_pages*() calls are not happening. And although the backtraces below show some of my new routines (like try_grab_page), they also confirm the above: there is no pin_user_page*() call in the stack. In particular, it looks like ib_umem_get() is calling through to get_user_pages*(), rather than pin_user_pages*(). I don't see how this is possible, because the code on my screen shows ib_umem_get() calling pin_user_pages_fast(). Any thoughts or ideas are welcome here. However, glossing over all of that and assuming that the new GUP_PIN_COUNTING_BIAS of 256 is applied, it's interesting that we still see any overflow. I'm less confident now that this is a true refcount overflow. Also, any information that would get me closer to being able to attempt my own reproduction of the problem are *very* welcome. :) thanks,
On Tue, Dec 24, 2019 at 06:03:50PM -0800, John Hubbard wrote: > On 12/22/19 5:23 AM, Leon Romanovsky wrote: > > On Fri, Dec 20, 2019 at 03:54:55PM -0800, John Hubbard wrote: > > > On 12/20/19 10:29 AM, Leon Romanovsky wrote: > > > ... > > > > > $ ./build.sh > > > > > $ build/bin/run_tests.py > > > > > > > > > > If you get things that far I think Leon can get a reproduction for you > > > > > > > > I'm not so optimistic about that. > > > > > > > > > > OK, I'm going to proceed for now on the assumption that I've got an overflow > > > problem that happens when huge pages are pinned. If I can get more information, > > > great, otherwise it's probably enough. > > > > > > One thing: for your repro, if you know the huge page size, and the system > > > page size for that case, that would really help. Also the number of pins per > > > page, more or less, that you'd expect. Because Jason says that only 2M huge > > > pages are used... > > > > > > Because the other possibility is that the refcount really is going negative, > > > likely due to a mismatched pin/unpin somehow. > > > > > > If there's not an obvious repro case available, but you do have one (is it easy > > > to repro, though?), then *if* you have the time, I could point you to a github > > > branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this: > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > index bb44c4d2ada7..8526fd03b978 100644 > > > --- a/include/linux/mm.h > > > +++ b/include/linux/mm.h > > > @@ -1077,7 +1077,7 @@ static inline void put_page(struct page *page) > > > * get_user_pages and page_mkclean and other calls that race to set up page > > > * table entries. > > > */ > > > -#define GUP_PIN_COUNTING_BIAS (1U << 10) > > > +#define GUP_PIN_COUNTING_BIAS (1U << 8) > > > > > > void unpin_user_page(struct page *page); > > > void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, > > > > > > If that fails to repro, then we would be zeroing in on the root cause. > > > > > > The branch is here (I just tested it and it seems healthy): > > > > > > git@github.com:johnhubbard/linux.git pin_user_pages_tracking_v11_with_diags > > > > Hi, > > > > We tested the following branch and here comes results: > > Thanks for this testing run! > > > [root@server consume_mtts]# (master) $ grep foll_pin /proc/vmstat > > nr_foll_pin_requested 0 > > nr_foll_pin_returned 0 > > > > Zero pinned pages! Maybe we are missing some CONFIG_* option? https://lore.kernel.org/linux-rdma/12a28917-f8c9-5092-2f01-92bb74714cae@nvidia.com/T/#mf900896f5dfc86cdee9246219990c632ed77115f > > ...now I'm confused. Somehow FOLL_PIN and pin_user_pages*() calls are > not happening. And although the backtraces below show some of my new > routines (like try_grab_page), they also confirm the above: there is no > pin_user_page*() call in the stack. > > In particular, it looks like ib_umem_get() is calling through to > get_user_pages*(), rather than pin_user_pages*(). I don't see how this > is possible, because the code on my screen shows ib_umem_get() calling > pin_user_pages_fast(). > > Any thoughts or ideas are welcome here. > > However, glossing over all of that and assuming that the new > GUP_PIN_COUNTING_BIAS of 256 is applied, it's interesting that we still > see any overflow. I'm less confident now that this is a true refcount > overflow. Earlier in this email thread, I posted possible function call chain which doesn't involve refcount overflow, but for some reason the refcount overflow was chosen as a way to explore. > > Also, any information that would get me closer to being able to attempt > my own reproduction of the problem are *very* welcome. :) It is ancient verification test (~10y) which is not an easy task to make it understandable and standalone :). > > thanks, > -- > John Hubbard > NVIDIA > > > [root@serer consume_mtts]# (master) $ dmesg > > [ 425.221459] ------------[ cut here ]------------ > > [ 425.225894] WARNING: CPU: 1 PID: 6738 at mm/gup.c:61 try_grab_compound_head+0x90/0xa0 > > [ 425.228021] Modules linked in: mlx5_ib mlx5_core mlxfw mlx4_ib mlx4_en ptp pps_core mlx4_core bonding ip6_gre ip6_tunnel tunnel6 ip_gre gre ip_tunnel rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm ib_uverbs ib_ipoib ib_umad ib_srp scsi_transport_srp rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core [last unloaded: mlxfw] > > [ 425.235266] CPU: 1 PID: 6738 Comm: consume_mtts Tainted: G O 5.5.0-rc2+ #1 > > [ 425.237480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 > > [ 425.239738] RIP: 0010:try_grab_compound_head+0x90/0xa0 > > [ 425.241170] Code: 06 48 8d 4f 34 f0 0f b1 57 34 74 cd 85 c0 74 cf 8d 14 06 f0 0f b1 11 74 c0 eb f1 8d 14 06 f0 0f b1 11 74 b5 85 c0 75 f3 eb b5 <0f> 0b 31 c0 c3 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 > > [ 425.245739] RSP: 0018:ffffc900006878a8 EFLAGS: 00010082 > > [ 425.247124] RAX: 0000000080000001 RBX: 00007f780488a000 RCX: 0000000000000bb0 > > [ 425.248956] RDX: ffffea000e031087 RSI: 0000000000008a00 RDI: ffffea000dc58000 > > [ 425.250761] RBP: ffffea000e031080 R08: ffffc90000687974 R09: 000fffffffe00000 > > [ 425.252661] R10: 0000000000000000 R11: ffff888362560000 R12: 000000000000008a > > [ 425.254487] R13: 80000003716000e7 R14: 00007f780488a000 R15: ffffc90000687974 > > [ 425.256309] FS: 00007f780d9d3740(0000) GS:ffff8883b1c80000(0000) knlGS:0000000000000000 > > [ 425.258401] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 425.259949] CR2: 0000000002334048 CR3: 000000039c68c001 CR4: 00000000001606a0 > > [ 425.261884] Call Trace: > > [ 425.262735] gup_pgd_range+0x517/0x5a0 > > [ 425.263819] internal_get_user_pages_fast+0x210/0x250 > > [ 425.265193] ib_umem_get+0x298/0x550 [ib_uverbs] > > [ 425.266476] mr_umem_get+0xc9/0x260 [mlx5_ib] > > [ 425.267699] mlx5_ib_reg_user_mr+0xcc/0x7e0 [mlx5_ib] > > [ 425.269134] ? xas_load+0x8/0x80 > > [ 425.270074] ? xa_load+0x48/0x90 > > [ 425.271038] ? lookup_get_idr_uobject.part.10+0x12/0x70 [ib_uverbs] > > [ 425.272757] ib_uverbs_reg_mr+0x127/0x280 [ib_uverbs] > > [ 425.274120] ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xc2/0xf0 [ib_uverbs] > > [ 425.276058] ib_uverbs_cmd_verbs.isra.6+0x5be/0xbe0 [ib_uverbs] > > [ 425.277657] ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs] > > [ 425.279155] ? __alloc_pages_nodemask+0x148/0x2b0 > > [ 425.280445] ib_uverbs_ioctl+0xc0/0x120 [ib_uverbs] > > [ 425.281755] do_vfs_ioctl+0x9d/0x650 > > [ 425.282766] ksys_ioctl+0x70/0x80 > > [ 425.283745] __x64_sys_ioctl+0x16/0x20 > > [ 425.284912] do_syscall_64+0x42/0x130 > > [ 425.285973] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > [ 425.287377] RIP: 0033:0x7f780d2df267 > > [ 425.288449] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48 > > [ 425.293073] RSP: 002b:00007ffce49a88a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > > [ 425.295034] RAX: ffffffffffffffda RBX: 00007ffce49a8938 RCX: 00007f780d2df267 > > [ 425.296895] RDX: 00007ffce49a8920 RSI: 00000000c0181b01 RDI: 0000000000000003 > > [ 425.298689] RBP: 00007ffce49a8900 R08: 0000000000000003 R09: 00007f780d9a1010 > > [ 425.300480] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f780d9a1150 > > [ 425.302290] R13: 00007ffce49a8900 R14: 00007ffce49a8ad8 R15: 00007f780468a000 > > [ 425.304113] ---[ end trace 1ecbefdb403190dd ]--- > > [ 425.305434] ------------[ cut here ]------------ > > [ 425.307147] WARNING: CPU: 1 PID: 6738 at mm/gup.c:150 try_grab_page+0x56/0x60 > > [ 425.309111] Modules linked in: mlx5_ib mlx5_core mlxfw mlx4_ib mlx4_en ptp pps_core mlx4_core bonding ip6_gre ip6_tunnel tunnel6 ip_gre gre ip_tunnel rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm ib_uverbs ib_ipoib ib_umad ib_srp scsi_transport_srp rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core [last unloaded: mlxfw] > > [ 425.316461] CPU: 1 PID: 6738 Comm: consume_mtts Tainted: G W O 5.5.0-rc2+ #1 > > [ 425.318582] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 > > [ 425.320958] RIP: 0010:try_grab_page+0x56/0x60 > > [ 425.322167] Code: 7e 28 f0 81 47 34 00 01 00 00 c3 48 8b 47 08 48 8d 50 ff a8 01 48 0f 45 fa 8b 47 34 85 c0 7e 0f f0 ff 47 34 b8 01 00 00 00 c3 <0f> 0b 31 c0 c3 0f 0b 31 c0 c3 0f 1f 44 00 00 41 57 41 56 41 55 41 > > [ 425.326814] RSP: 0018:ffffc90000687830 EFLAGS: 00010282 > > [ 425.328226] RAX: 0000000000000001 RBX: ffffea000dc58000 RCX: ffffea000e031087 > > [ 425.330104] RDX: 0000000080000001 RSI: 0000000000040000 RDI: ffffea000dc58000 > > [ 425.331980] RBP: 00007f7804800000 R08: 000ffffffffff000 R09: 80000003716000e7 > > [ 425.333898] R10: ffff88834af80120 R11: ffff8883ac16f000 R12: ffff88834af80120 > > [ 425.335704] R13: ffff88837c0915c0 R14: 0000000000050201 R15: 00007f7804800000 > > [ 425.337638] FS: 00007f780d9d3740(0000) GS:ffff8883b1c80000(0000) knlGS:0000000000000000 > > [ 425.339734] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 425.341369] CR2: 0000000002334048 CR3: 000000039c68c001 CR4: 00000000001606a0 > > [ 425.343160] Call Trace: > > [ 425.343967] follow_trans_huge_pmd+0x16f/0x2e0 > > [ 425.345263] follow_p4d_mask+0x51c/0x630 > > [ 425.346344] __get_user_pages+0x1a1/0x6c0 > > [ 425.347463] internal_get_user_pages_fast+0x17b/0x250 > > [ 425.348918] ib_umem_get+0x298/0x550 [ib_uverbs] > > [ 425.350174] mr_umem_get+0xc9/0x260 [mlx5_ib] > > [ 425.351383] mlx5_ib_reg_user_mr+0xcc/0x7e0 [mlx5_ib] > > [ 425.352849] ? xas_load+0x8/0x80 > > [ 425.353776] ? xa_load+0x48/0x90 > > [ 425.354730] ? lookup_get_idr_uobject.part.10+0x12/0x70 [ib_uverbs] > > [ 425.356410] ib_uverbs_reg_mr+0x127/0x280 [ib_uverbs] > > [ 425.357843] ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xc2/0xf0 [ib_uverbs] > > [ 425.359749] ib_uverbs_cmd_verbs.isra.6+0x5be/0xbe0 [ib_uverbs] > > [ 425.361405] ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs] > > [ 425.362898] ? __alloc_pages_nodemask+0x148/0x2b0 > > [ 425.364206] ib_uverbs_ioctl+0xc0/0x120 [ib_uverbs] > > [ 425.365564] do_vfs_ioctl+0x9d/0x650 > > [ 425.366567] ksys_ioctl+0x70/0x80 > > [ 425.367537] __x64_sys_ioctl+0x16/0x20 > > [ 425.368698] do_syscall_64+0x42/0x130 > > [ 425.369782] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > [ 425.371117] RIP: 0033:0x7f780d2df267 > > [ 425.372159] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48 > > [ 425.376774] RSP: 002b:00007ffce49a88a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > > [ 425.378740] RAX: ffffffffffffffda RBX: 00007ffce49a8938 RCX: 00007f780d2df267 > > [ 425.380598] RDX: 00007ffce49a8920 RSI: 00000000c0181b01 RDI: 0000000000000003 > > [ 425.382411] RBP: 00007ffce49a8900 R08: 0000000000000003 R09: 00007f780d9a1010 > > [ 425.384312] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f780d9a1150 > > [ 425.386132] R13: 00007ffce49a8900 R14: 00007ffce49a8ad8 R15: 00007f780468a000 > > [ 425.387964] ---[ end trace 1ecbefdb403190de ]--- > > > > Thanks > > > > > > > > > > > > > > thanks, > > > -- > > > John Hubbard > > > NVIDIA
On 12/24/19 9:26 PM, Leon Romanovsky wrote: ... >>>> The branch is here (I just tested it and it seems healthy): >>>> >>>> git@github.com:johnhubbard/linux.git pin_user_pages_tracking_v11_with_diags >>> >>> Hi, >>> >>> We tested the following branch and here comes results: >> >> Thanks for this testing run! >> >>> [root@server consume_mtts]# (master) $ grep foll_pin /proc/vmstat >>> nr_foll_pin_requested 0 >>> nr_foll_pin_returned 0 >>> >> >> Zero pinned pages! > > Maybe we are missing some CONFIG_* option? > https://lore.kernel.org/linux-rdma/12a28917-f8c9-5092-2f01-92bb74714cae@nvidia.com/T/#mf900896f5dfc86cdee9246219990c632ed77115f Ah OK, it must be that CONFIG_DEBUG_VM is not set, thanks! >> ...now I'm confused. Somehow FOLL_PIN and pin_user_pages*() calls are >> not happening. And although the backtraces below show some of my new >> routines (like try_grab_page), they also confirm the above: there is no >> pin_user_page*() call in the stack. >> >> In particular, it looks like ib_umem_get() is calling through to >> get_user_pages*(), rather than pin_user_pages*(). I don't see how this >> is possible, because the code on my screen shows ib_umem_get() calling >> pin_user_pages_fast(). >> It must be that pin_user_pages() is in the call stack, but just not getting printed. There's no other way to explain this. >> Any thoughts or ideas are welcome here. >> >> However, glossing over all of that and assuming that the new >> GUP_PIN_COUNTING_BIAS of 256 is applied, it's interesting that we still >> see any overflow. I'm less confident now that this is a true refcount >> overflow. > > Earlier in this email thread, I posted possible function call chain which > doesn't involve refcount overflow, but for some reason the refcount > overflow was chosen as a way to explore. > Well, both of the WARN() calls are asserting that the refcount went negative (well, one asserts negative, and the other asserts "<=0"). So it's pretty hard to talk our way out of a refcount overflow here. >> >> Also, any information that would get me closer to being able to attempt >> my own reproduction of the problem are *very* welcome. :) > > It is ancient verification test (~10y) which is not an easy task to > make it understandable and standalone :). > Is this the only test that fails, btw? No other test failures or hints of problems? (Also, maybe hopeless, but can *anyone* on the RDMA list provide some characterization of the test, such as how many pins per page, what page sizes are used? I'm still hoping to write a test to trigger something close to this...) I do have a couple more ideas for test runs: 1. Reduce GUP_PIN_COUNTING_BIAS to 1. That would turn the whole override of page->_refcount into a no-op, and so if all is well (it may not be!) with the rest of the patch, then we'd expect this problem to not reappear. 2. Active /proc/vmstat *foll_pin* statistics unconditionally (just for these tests, of course), so we can see if there is a get/put mismatch. However, that will change the timing, and so it must be attempted independently of (1), in order to see if it ends up hiding the repro. I've updated this branch to implement (1), but not (2), hoping you can give this one a spin? git@github.com:johnhubbard/linux.git pin_user_pages_tracking_v11_with_diags thanks,
On 12/27/19 1:56 PM, John Hubbard wrote: ... >> It is ancient verification test (~10y) which is not an easy task to >> make it understandable and standalone :). >> > > Is this the only test that fails, btw? No other test failures or hints of > problems? > > (Also, maybe hopeless, but can *anyone* on the RDMA list provide some > characterization of the test, such as how many pins per page, what page > sizes are used? I'm still hoping to write a test to trigger something > close to this...) > > I do have a couple more ideas for test runs: > > 1. Reduce GUP_PIN_COUNTING_BIAS to 1. That would turn the whole override of > page->_refcount into a no-op, and so if all is well (it may not be!) with the > rest of the patch, then we'd expect this problem to not reappear. > > 2. Active /proc/vmstat *foll_pin* statistics unconditionally (just for these > tests, of course), so we can see if there is a get/put mismatch. However, that > will change the timing, and so it must be attempted independently of (1), in > order to see if it ends up hiding the repro. > > I've updated this branch to implement (1), but not (2), hoping you can give > this one a spin? > > git@github.com:johnhubbard/linux.git pin_user_pages_tracking_v11_with_diags > > Also, looking ahead: a) if the problem disappears with the latest above test, then we likely have a huge page refcount overflow, and there are a couple of different ways to fix it. b) if it still reproduces with the above, then it's some other random mistake, and in that case I'd be inclined to do a sort of guided (or classic, unguided) git bisect of the series. Because it could be any of several patches. If that's too much trouble, then I'd have to fall back to submitting a few patches at a time and working my way up to the tracking patch... thanks,
On Sat 28-12-19 20:33:32, John Hubbard wrote: > On 12/27/19 1:56 PM, John Hubbard wrote: > ... > >> It is ancient verification test (~10y) which is not an easy task to > >> make it understandable and standalone :). > >> > > > > Is this the only test that fails, btw? No other test failures or hints of > > problems? > > > > (Also, maybe hopeless, but can *anyone* on the RDMA list provide some > > characterization of the test, such as how many pins per page, what page > > sizes are used? I'm still hoping to write a test to trigger something > > close to this...) > > > > I do have a couple more ideas for test runs: > > > > 1. Reduce GUP_PIN_COUNTING_BIAS to 1. That would turn the whole override of > > page->_refcount into a no-op, and so if all is well (it may not be!) with the > > rest of the patch, then we'd expect this problem to not reappear. > > > > 2. Active /proc/vmstat *foll_pin* statistics unconditionally (just for these > > tests, of course), so we can see if there is a get/put mismatch. However, that > > will change the timing, and so it must be attempted independently of (1), in > > order to see if it ends up hiding the repro. > > > > I've updated this branch to implement (1), but not (2), hoping you can give > > this one a spin? > > > > git@github.com:johnhubbard/linux.git pin_user_pages_tracking_v11_with_diags > > > > > > Also, looking ahead: > > a) if the problem disappears with the latest above test, then we likely have > a huge page refcount overflow, and there are a couple of different ways to > fix it. > > b) if it still reproduces with the above, then it's some other random mistake, > and in that case I'd be inclined to do a sort of guided (or classic, unguided) > git bisect of the series. Because it could be any of several patches. > > If that's too much trouble, then I'd have to fall back to submitting a few > patches at a time and working my way up to the tracking patch... It could also be that an ordinary page reference is dropped with 'unpin' thus underflowing the page refcount... Honza
On 1/6/20 1:01 AM, Jan Kara wrote: ... >> Also, looking ahead: >> >> a) if the problem disappears with the latest above test, then we likely have >> a huge page refcount overflow, and there are a couple of different ways to >> fix it. >> >> b) if it still reproduces with the above, then it's some other random mistake, >> and in that case I'd be inclined to do a sort of guided (or classic, unguided) >> git bisect of the series. Because it could be any of several patches. >> >> If that's too much trouble, then I'd have to fall back to submitting a few >> patches at a time and working my way up to the tracking patch... > > It could also be that an ordinary page reference is dropped with 'unpin' > thus underflowing the page refcount... > > Honza > Yes. And, I think I'm about out of time for this release cycle, so I'm probably going to submit the prerequisite patches (patches 1-10, or more boldly, 1-22), for candidates for 5.6. thanks,