mbox series

[v2,0/6] mm/devm_memremap_pages: Fix page release race

Message ID 155727335978.292046.12068191395005445711.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
Headers show
Series mm/devm_memremap_pages: Fix page release race | expand

Message

Dan Williams May 7, 2019, 11:55 p.m. UTC
Changes since v1 [1]:
- Fix a NULL-pointer deref crash in pci_p2pdma_release() (Logan)

- Refresh the p2pdma patch headers to match the format of other p2pdma
  patches (Bjorn)

- Collect Ira's reviewed-by

[1]: https://lore.kernel.org/lkml/155387324370.2443841.574715745262628837.stgit@dwillia2-desk3.amr.corp.intel.com/

---

Logan audited the devm_memremap_pages() shutdown path and noticed that
it was possible to proceed to arch_remove_memory() before all
potential page references have been reaped.

Introduce a new ->cleanup() callback to do the work of waiting for any
straggling page references and then perform the percpu_ref_exit() in
devm_memremap_pages_release() context.

For p2pdma this involves some deeper reworks to reference count
resources on a per-instance basis rather than a per pci-device basis. A
modified genalloc api is introduced to convey a driver-private pointer
through gen_pool_{alloc,free}() interfaces. Also, a
devm_memunmap_pages() api is introduced since p2pdma does not
auto-release resources on a setup failure.

The dax and pmem changes pass the nvdimm unit tests, and the p2pdma
changes should now pass testing with the pci_p2pdma_release() fix.
Jérôme, how does this look for HMM?

In general, I think these patches / fixes are suitable for v5.2-rc1 or
v5.2-rc2, and since they touch kernel/memremap.c, and other various
pieces of the core, they should go through the -mm tree. These patches
merge cleanly with the current state of -next, pass the nvdimm unit
tests, and are exposed to the 0day robot with no issues reported
(https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=libnvdimm-pending).

---

Dan Williams (6):
      drivers/base/devres: Introduce devm_release_action()
      mm/devm_memremap_pages: Introduce devm_memunmap_pages
      PCI/P2PDMA: Fix the gen_pool_add_virt() failure path
      lib/genalloc: Introduce chunk owners
      PCI/P2PDMA: Track pgmap references per resource, not globally
      mm/devm_memremap_pages: Fix final page put race


 drivers/base/devres.c             |   24 +++++++-
 drivers/dax/device.c              |   13 +---
 drivers/nvdimm/pmem.c             |   17 ++++-
 drivers/pci/p2pdma.c              |  115 +++++++++++++++++++++++--------------
 include/linux/device.h            |    1 
 include/linux/genalloc.h          |   55 ++++++++++++++++--
 include/linux/memremap.h          |    8 +++
 kernel/memremap.c                 |   23 ++++++-
 lib/genalloc.c                    |   51 ++++++++--------
 mm/hmm.c                          |   14 +----
 tools/testing/nvdimm/test/iomap.c |    2 +
 11 files changed, 217 insertions(+), 106 deletions(-)

Comments

Logan Gunthorpe May 8, 2019, 5:05 p.m. UTC | #1
On 2019-05-07 5:55 p.m., Dan Williams wrote:
> Changes since v1 [1]:
> - Fix a NULL-pointer deref crash in pci_p2pdma_release() (Logan)
> 
> - Refresh the p2pdma patch headers to match the format of other p2pdma
>    patches (Bjorn)
> 
> - Collect Ira's reviewed-by
> 
> [1]: https://lore.kernel.org/lkml/155387324370.2443841.574715745262628837.stgit@dwillia2-desk3.amr.corp.intel.com/

This series looks good to me:

Reviewed-by: Logan Gunthorpe <logang@deltatee.com>

However, I haven't tested it yet but I intend to later this week.

Thanks,

Logan
Logan Gunthorpe May 13, 2019, 7:22 p.m. UTC | #2
On 2019-05-08 11:05 a.m., Logan Gunthorpe wrote:
> 
> 
> On 2019-05-07 5:55 p.m., Dan Williams wrote:
>> Changes since v1 [1]:
>> - Fix a NULL-pointer deref crash in pci_p2pdma_release() (Logan)
>>
>> - Refresh the p2pdma patch headers to match the format of other p2pdma
>>    patches (Bjorn)
>>
>> - Collect Ira's reviewed-by
>>
>> [1]: https://lore.kernel.org/lkml/155387324370.2443841.574715745262628837.stgit@dwillia2-desk3.amr.corp.intel.com/
> 
> This series looks good to me:
> 
> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
> 
> However, I haven't tested it yet but I intend to later this week.

I've tested libnvdimm-pending which includes this series on my setup and
everything works great.

Thanks,

Logan
Jane Chu May 14, 2019, 6:51 p.m. UTC | #3
On 5/13/2019 12:22 PM, Logan Gunthorpe wrote:

>
> On 2019-05-08 11:05 a.m., Logan Gunthorpe wrote:
>>
>> On 2019-05-07 5:55 p.m., Dan Williams wrote:
>>> Changes since v1 [1]:
>>> - Fix a NULL-pointer deref crash in pci_p2pdma_release() (Logan)
>>>
>>> - Refresh the p2pdma patch headers to match the format of other p2pdma
>>>     patches (Bjorn)
>>>
>>> - Collect Ira's reviewed-by
>>>
>>> [1]: https://lore.kernel.org/lkml/155387324370.2443841.574715745262628837.stgit@dwillia2-desk3.amr.corp.intel.com/
>> This series looks good to me:
>>
>> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
>>
>> However, I haven't tested it yet but I intend to later this week.
> I've tested libnvdimm-pending which includes this series on my setup and
> everything works great.

Just wondering in a difference scenario where pmem pages are exported to
a KVM guest, and then by mistake the user issues "ndctl destroy-namespace -f",
will the kernel wait indefinitely until the user figures out to kill the guest
and release the pmem pages?

thanks,
-jane
  

>
> Thanks,
>
> Logan
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
Dan Williams May 14, 2019, 7:04 p.m. UTC | #4
On Tue, May 14, 2019 at 11:53 AM Jane Chu <jane.chu@oracle.com> wrote:
>
> On 5/13/2019 12:22 PM, Logan Gunthorpe wrote:
>
> On 2019-05-08 11:05 a.m., Logan Gunthorpe wrote:
>
> On 2019-05-07 5:55 p.m., Dan Williams wrote:
>
> Changes since v1 [1]:
> - Fix a NULL-pointer deref crash in pci_p2pdma_release() (Logan)
>
> - Refresh the p2pdma patch headers to match the format of other p2pdma
>    patches (Bjorn)
>
> - Collect Ira's reviewed-by
>
> [1]: https://lore.kernel.org/lkml/155387324370.2443841.574715745262628837.stgit@dwillia2-desk3.amr.corp.intel.com/
>
> This series looks good to me:
>
> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
>
> However, I haven't tested it yet but I intend to later this week.
>
> I've tested libnvdimm-pending which includes this series on my setup and
> everything works great.
>
> Just wondering in a difference scenario where pmem pages are exported to
> a KVM guest, and then by mistake the user issues "ndctl destroy-namespace -f",
> will the kernel wait indefinitely until the user figures out to kill the guest
> and release the pmem pages?

It depends on whether the pages are pinned. Typically DAX memory
mappings assigned to a guest are not pinned in the host and can be
invalidated at any time. The pinning only occurs with VFIO and
device-assignment which isn't the common case, especially since that
configuration is blocked by fsdax. However, with devdax, yes you can
arrange for the system to go into an indefinite wait.

This somewhat ties back to the get_user_pages() vs DAX debate. The
indefinite stall issue with device-assignment could be addressed with
a requirement to hold a lease and expect that a lease revocation event
may escalate to SIGKILL in response to 'ndctl destroy-namespace'. The
expectation with device-dax is that it is already a raw interface with
pointy edges and caveats, but I would not be opposed to introducing a
lease semantic.
Jane Chu May 14, 2019, 9:18 p.m. UTC | #5
On 5/14/2019 12:04 PM, Dan Williams wrote:

> On Tue, May 14, 2019 at 11:53 AM Jane Chu <jane.chu@oracle.com> wrote:
>> On 5/13/2019 12:22 PM, Logan Gunthorpe wrote:
>>
>> On 2019-05-08 11:05 a.m., Logan Gunthorpe wrote:
>>
>> On 2019-05-07 5:55 p.m., Dan Williams wrote:
>>
>> Changes since v1 [1]:
>> - Fix a NULL-pointer deref crash in pci_p2pdma_release() (Logan)
>>
>> - Refresh the p2pdma patch headers to match the format of other p2pdma
>>     patches (Bjorn)
>>
>> - Collect Ira's reviewed-by
>>
>> [1]: https://lore.kernel.org/lkml/155387324370.2443841.574715745262628837.stgit@dwillia2-desk3.amr.corp.intel.com/
>>
>> This series looks good to me:
>>
>> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
>>
>> However, I haven't tested it yet but I intend to later this week.
>>
>> I've tested libnvdimm-pending which includes this series on my setup and
>> everything works great.
>>
>> Just wondering in a difference scenario where pmem pages are exported to
>> a KVM guest, and then by mistake the user issues "ndctl destroy-namespace -f",
>> will the kernel wait indefinitely until the user figures out to kill the guest
>> and release the pmem pages?
> It depends on whether the pages are pinned. Typically DAX memory
> mappings assigned to a guest are not pinned in the host and can be
> invalidated at any time. The pinning only occurs with VFIO and
> device-assignment which isn't the common case, especially since that
> configuration is blocked by fsdax. However, with devdax, yes you can
> arrange for the system to go into an indefinite wait.
>
> This somewhat ties back to the get_user_pages() vs DAX debate. The
> indefinite stall issue with device-assignment could be addressed with
> a requirement to hold a lease and expect that a lease revocation event
> may escalate to SIGKILL in response to 'ndctl destroy-namespace'. The
> expectation with device-dax is that it is already a raw interface with
> pointy edges and caveats, but I would not be opposed to introducing a
> lease semantic.

Thanks for the quick response Dan.

I am not convinced that the get_user_pages() vs FS-DAX dilemma is a perfect
comparison to "ndctl destroy-namespace -f" vs namespace-is-busy dilemma.

Others might disagree with me, I thought that there is no risk of panic
if we fail "ndctl destroy-namespace -f" to honor a clean shutdown of the
user application. Also, both actions are on the same host, so in theory
the admin could shutdown the application before attempt a destructive
action.

By allowing 'opposite' actions in competition with each other at fine
granularity, there is potential for panic in general, not necessarily with
pinned page I guess.  I just ran an experiment and panic'd the system.

So, as Optane DCPMEM is generally for server/cloud deployment, and as
RAS is a priority for server over administrative commands, to allow
namespace management command to panic kernel is not an option?

Here is my stress experiment -
   
Start out with ./create_nm.sh to create as many 48G devdax namespaces
as possible. Once that's completed, firing up 6 actions in quick
successions in below order:
   -> ndctl destroy-namespace all -f
   -> ./create_nm.sh
   -> ndctl destroy-namespace all -f
   -> ./create_nm.sh
   -> ndctl destroy-namespace all -f
   -> ./create_nm.sh

==========  console message =======
Kernel 5.1.0-rc7-next-20190501-libnvdimm-pending on an x86_64

ban25uut130 login: [ 1620.866813] BUG: kernel NULL pointer dereference, address: 0000000000000020
[ 1620.874585] #PF: supervisor read access in kernel mode
[ 1620.880319] #PF: error_code(0x0000) - not-present page
[ 1620.886052] PGD 0 P4D 0
[ 1620.888879] Oops: 0000 [#1] SMP NOPTI
[ 1620.892964] CPU: 19 PID: 5611 Comm: kworker/u130:3 Tainted: G        W         5.1.0-rc7-next-20190501-libnvdimm-pending #5
[ 1620.905389] Hardware name: Oracle Corporation ORACLE SERVER X8-2L/ASM,MTHRBD,2U, BIOS 52020101 05/07/2019
[ 1620.916069] Workqueue: events_unbound async_run_entry_fn
[ 1620.921997] RIP: 0010:klist_put+0x1b/0x6c
[ 1620.926471] Code: 48 8b 43 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 55 48 89 e5 41 56 41 89 f6 41 55 41 54 53 4c 8b 27 48 89 fb 49 83 e4 fe 4c 89 e7 <4d> 8b 6c 24 20 e8 3a d4 01 00 45 84 f6 74 10 48 8b 03 a8 01 74 02
[ 1620.947427] RSP: 0018:ffffb1a5e6727da0 EFLAGS: 00010246
[ 1620.953258] RAX: ffff956796604c00 RBX: ffff956796604c28 RCX: 0000000000000000
[ 1620.961223] RDX: ffff955000c2c4d8 RSI: 0000000000000001 RDI: 0000000000000000
[ 1620.969185] RBP: ffffb1a5e6727dc0 R08: 0000000000000002 R09: ffffffffbb54b3c0
[ 1620.977150] R10: ffffb1a5e6727d40 R11: fefefefefefefeff R12: 0000000000000000
[ 1620.985116] R13: ffff94d18dcfd000 R14: 0000000000000001 R15: ffff955000caf140
[ 1620.993081] FS:  0000000000000000(0000) GS:ffff95679f4c0000(0000) knlGS:0000000000000000
[ 1621.002113] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1621.008524] CR2: 0000000000000020 CR3: 0000009fa100a005 CR4: 00000000007606e0
[ 1621.016487] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1621.024450] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1621.032413] PKRU: 55555554
[ 1621.035433] Call Trace:
[ 1621.038161]  klist_del+0xe/0x10
[ 1621.041667]  device_del+0x8a/0x2c9
[ 1621.045463]  ? __switch_to_asm+0x34/0x70
[ 1621.049840]  ? __switch_to_asm+0x40/0x70
[ 1621.054220]  device_unregister+0x44/0x4f
[ 1621.058603]  nd_async_device_unregister+0x22/0x2d [libnvdimm]
[ 1621.065016]  async_run_entry_fn+0x47/0x15a
[ 1621.069588]  process_one_work+0x1a2/0x2eb
[ 1621.074064]  worker_thread+0x1b8/0x26e
[ 1621.078239]  ? cancel_delayed_work_sync+0x15/0x15
[ 1621.083490]  kthread+0xf8/0xfd
[ 1621.086897]  ? kthread_destroy_worker+0x45/0x45
[ 1621.091954]  ret_from_fork+0x1f/0x40
[ 1621.095944] Modules linked in: xt_REDIRECT xt_nat xt_CHECKSUM iptable_mangle xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter iptable_filter scsi_transport_iscsi ip6table_nat ip6_tables iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 vfat fat skx_edac intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt iTCO_vendor_support aesni_intel ipmi_si crypto_simd cryptd glue_helper ipmi_devintf ipmi_msghandler sg pcspkr dax_pmem_compat device_dax dax_pmem_core i2c_i801 pcc_cpufreq lpc_ich ioatdma wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c nd_pmem nd_btt sr_mod cdrom sd_mod mgag200 drm_kms_helper syscopyarea crc32c_intel sysfillrect sysimgblt fb_sys_fops ttm megaraid_sas drm igb ahci libahci ptp libata pps_core dca i2c_algo_bit nfit libnvdimm uas usb_storage dm_mirror dm_region_hash dm_log dm_mod
[ 1621.189449] CR2: 0000000000000020
[ 1621.193169] ---[ end trace 7c3f7029ef24aa5a ]---
[ 1621.305383] RIP: 0010:klist_put+0x1b/0x6c
[ 1621.309860] Code: 48 8b 43 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 55 48 89 e5 41 56 41 89 f6 41 55 41 54 53 4c 8b 27 48 89 fb 49 83 e4 fe 4c 89 e7 <4d> 8b 6c 24 20 e8 3a d4 01 00 45 84 f6 74 10 48 8b 03 a8 01 74 02
[ 1621.330809] RSP: 0018:ffffb1a5e6727da0 EFLAGS: 00010246
[ 1621.336642] RAX: ffff956796604c00 RBX: ffff956796604c28 RCX: 0000000000000000
[ 1621.344606] RDX: ffff955000c2c4d8 RSI: 0000000000000001 RDI: 0000000000000000
[ 1621.352570] RBP: ffffb1a5e6727dc0 R08: 0000000000000002 R09: ffffffffbb54b3c0
[ 1621.360533] R10: ffffb1a5e6727d40 R11: fefefefefefefeff R12: 0000000000000000
[ 1621.368496] R13: ffff94d18dcfd000 R14: 0000000000000001 R15: ffff955000caf140
[ 1621.376460] FS:  0000000000000000(0000) GS:ffff95679f4c0000(0000) knlGS:0000000000000000
[ 1621.385490] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1621.391902] CR2: 0000000000000020 CR3: 0000009fa100a005 CR4: 00000000007606e0
[ 1621.399867] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1621.407830] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1621.415793] PKRU: 55555554
[ 1621.418814] Kernel panic - not syncing: Fatal exception
[ 1621.424740] Kernel Offset: 0x39000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 1621.550711] ---[ end Kernel panic - not syncing: Fatal exception ]---


Thanks!
-jane
Jane Chu May 16, 2019, 4:45 p.m. UTC | #6
Hi,

I'm able to reproduce the panic below by running two sets of ndctl
commands that actually serve legitimate purpose in parallel (unlike
the brute force experiment earlier), each set in a indefinite loop.
This time it takes about an hour to panic.  But I gather the cause
is probably the same: I've overlapped ndctl commands on the same
region.

Could we add a check in nd_ioctl(), such that if there is
an ongoing ndctl command on a region, subsequent ndctl request
will fail immediately with something to the effect of EAGAIN?
The rationale being that kernel should protect itself against
user mistakes.

Also, sensing the subject fix is for a different problem, and has been
verified, I'm happy to see it in upstream, so we have a better
code base to digger deeper in terms of how the destructive ndctl
commands interacts to typical mission critical applications, include
but not limited to rdma.

thanks,
-jane

On 5/14/2019 2:18 PM, Jane Chu wrote:
> On 5/14/2019 12:04 PM, Dan Williams wrote:
>
>> On Tue, May 14, 2019 at 11:53 AM Jane Chu <jane.chu@oracle.com> wrote:
>>> On 5/13/2019 12:22 PM, Logan Gunthorpe wrote:
>>>
>>> On 2019-05-08 11:05 a.m., Logan Gunthorpe wrote:
>>>
>>> On 2019-05-07 5:55 p.m., Dan Williams wrote:
>>>
>>> Changes since v1 [1]:
>>> - Fix a NULL-pointer deref crash in pci_p2pdma_release() (Logan)
>>>
>>> - Refresh the p2pdma patch headers to match the format of other p2pdma
>>>     patches (Bjorn)
>>>
>>> - Collect Ira's reviewed-by
>>>
>>> [1]: 
>>> https://lore.kernel.org/lkml/155387324370.2443841.574715745262628837.stgit@dwillia2-desk3.amr.corp.intel.com/
>>>
>>> This series looks good to me:
>>>
>>> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
>>>
>>> However, I haven't tested it yet but I intend to later this week.
>>>
>>> I've tested libnvdimm-pending which includes this series on my setup 
>>> and
>>> everything works great.
>>>
>>> Just wondering in a difference scenario where pmem pages are 
>>> exported to
>>> a KVM guest, and then by mistake the user issues "ndctl 
>>> destroy-namespace -f",
>>> will the kernel wait indefinitely until the user figures out to kill 
>>> the guest
>>> and release the pmem pages?
>> It depends on whether the pages are pinned. Typically DAX memory
>> mappings assigned to a guest are not pinned in the host and can be
>> invalidated at any time. The pinning only occurs with VFIO and
>> device-assignment which isn't the common case, especially since that
>> configuration is blocked by fsdax. However, with devdax, yes you can
>> arrange for the system to go into an indefinite wait.
>>
>> This somewhat ties back to the get_user_pages() vs DAX debate. The
>> indefinite stall issue with device-assignment could be addressed with
>> a requirement to hold a lease and expect that a lease revocation event
>> may escalate to SIGKILL in response to 'ndctl destroy-namespace'. The
>> expectation with device-dax is that it is already a raw interface with
>> pointy edges and caveats, but I would not be opposed to introducing a
>> lease semantic.
>
> Thanks for the quick response Dan.
>
> I am not convinced that the get_user_pages() vs FS-DAX dilemma is a 
> perfect
> comparison to "ndctl destroy-namespace -f" vs namespace-is-busy dilemma.
>
> Others might disagree with me, I thought that there is no risk of panic
> if we fail "ndctl destroy-namespace -f" to honor a clean shutdown of the
> user application. Also, both actions are on the same host, so in theory
> the admin could shutdown the application before attempt a destructive
> action.
>
> By allowing 'opposite' actions in competition with each other at fine
> granularity, there is potential for panic in general, not necessarily 
> with
> pinned page I guess.  I just ran an experiment and panic'd the system.
>
> So, as Optane DCPMEM is generally for server/cloud deployment, and as
> RAS is a priority for server over administrative commands, to allow
> namespace management command to panic kernel is not an option?
>
> Here is my stress experiment -
>   Start out with ./create_nm.sh to create as many 48G devdax namespaces
> as possible. Once that's completed, firing up 6 actions in quick
> successions in below order:
>   -> ndctl destroy-namespace all -f
>   -> ./create_nm.sh
>   -> ndctl destroy-namespace all -f
>   -> ./create_nm.sh
>   -> ndctl destroy-namespace all -f
>   -> ./create_nm.sh
>
> ==========  console message =======
> Kernel 5.1.0-rc7-next-20190501-libnvdimm-pending on an x86_64
>
> ban25uut130 login: [ 1620.866813] BUG: kernel NULL pointer 
> dereference, address: 0000000000000020
> [ 1620.874585] #PF: supervisor read access in kernel mode
> [ 1620.880319] #PF: error_code(0x0000) - not-present page
> [ 1620.886052] PGD 0 P4D 0
> [ 1620.888879] Oops: 0000 [#1] SMP NOPTI
> [ 1620.892964] CPU: 19 PID: 5611 Comm: kworker/u130:3 Tainted: 
> G        W         5.1.0-rc7-next-20190501-libnvdimm-pending #5
> [ 1620.905389] Hardware name: Oracle Corporation ORACLE SERVER 
> X8-2L/ASM,MTHRBD,2U, BIOS 52020101 05/07/2019
> [ 1620.916069] Workqueue: events_unbound async_run_entry_fn
> [ 1620.921997] RIP: 0010:klist_put+0x1b/0x6c
> [ 1620.926471] Code: 48 8b 43 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 55 
> 48 89 e5 41 56 41 89 f6 41 55 41 54 53 4c 8b 27 48 89 fb 49 83 e4 fe 
> 4c 89 e7 <4d> 8b 6c 24 20 e8 3a d4 01 00 45 84 f6 74 10 48 8b 03 a8 01 
> 74 02
> [ 1620.947427] RSP: 0018:ffffb1a5e6727da0 EFLAGS: 00010246
> [ 1620.953258] RAX: ffff956796604c00 RBX: ffff956796604c28 RCX: 
> 0000000000000000
> [ 1620.961223] RDX: ffff955000c2c4d8 RSI: 0000000000000001 RDI: 
> 0000000000000000
> [ 1620.969185] RBP: ffffb1a5e6727dc0 R08: 0000000000000002 R09: 
> ffffffffbb54b3c0
> [ 1620.977150] R10: ffffb1a5e6727d40 R11: fefefefefefefeff R12: 
> 0000000000000000
> [ 1620.985116] R13: ffff94d18dcfd000 R14: 0000000000000001 R15: 
> ffff955000caf140
> [ 1620.993081] FS:  0000000000000000(0000) GS:ffff95679f4c0000(0000) 
> knlGS:0000000000000000
> [ 1621.002113] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1621.008524] CR2: 0000000000000020 CR3: 0000009fa100a005 CR4: 
> 00000000007606e0
> [ 1621.016487] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [ 1621.024450] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
> 0000000000000400
> [ 1621.032413] PKRU: 55555554
> [ 1621.035433] Call Trace:
> [ 1621.038161]  klist_del+0xe/0x10
> [ 1621.041667]  device_del+0x8a/0x2c9
> [ 1621.045463]  ? __switch_to_asm+0x34/0x70
> [ 1621.049840]  ? __switch_to_asm+0x40/0x70
> [ 1621.054220]  device_unregister+0x44/0x4f
> [ 1621.058603]  nd_async_device_unregister+0x22/0x2d [libnvdimm]
> [ 1621.065016]  async_run_entry_fn+0x47/0x15a
> [ 1621.069588]  process_one_work+0x1a2/0x2eb
> [ 1621.074064]  worker_thread+0x1b8/0x26e
> [ 1621.078239]  ? cancel_delayed_work_sync+0x15/0x15
> [ 1621.083490]  kthread+0xf8/0xfd
> [ 1621.086897]  ? kthread_destroy_worker+0x45/0x45
> [ 1621.091954]  ret_from_fork+0x1f/0x40
> [ 1621.095944] Modules linked in: xt_REDIRECT xt_nat xt_CHECKSUM 
> iptable_mangle xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 
> tun bridge stp llc ebtable_filter ebtables ip6table_filter 
> iptable_filter scsi_transport_iscsi ip6table_nat ip6_tables 
> iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 vfat fat 
> skx_edac intel_powerclamp coretemp kvm_intel kvm irqbypass 
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt 
> iTCO_vendor_support aesni_intel ipmi_si crypto_simd cryptd glue_helper 
> ipmi_devintf ipmi_msghandler sg pcspkr dax_pmem_compat device_dax 
> dax_pmem_core i2c_i801 pcc_cpufreq lpc_ich ioatdma wmi nfsd 
> auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c nd_pmem 
> nd_btt sr_mod cdrom sd_mod mgag200 drm_kms_helper syscopyarea 
> crc32c_intel sysfillrect sysimgblt fb_sys_fops ttm megaraid_sas drm 
> igb ahci libahci ptp libata pps_core dca i2c_algo_bit nfit libnvdimm 
> uas usb_storage dm_mirror dm_region_hash dm_log dm_mod
> [ 1621.189449] CR2: 0000000000000020
> [ 1621.193169] ---[ end trace 7c3f7029ef24aa5a ]---
> [ 1621.305383] RIP: 0010:klist_put+0x1b/0x6c
> [ 1621.309860] Code: 48 8b 43 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 55 
> 48 89 e5 41 56 41 89 f6 41 55 41 54 53 4c 8b 27 48 89 fb 49 83 e4 fe 
> 4c 89 e7 <4d> 8b 6c 24 20 e8 3a d4 01 00 45 84 f6 74 10 48 8b 03 a8 01 
> 74 02
> [ 1621.330809] RSP: 0018:ffffb1a5e6727da0 EFLAGS: 00010246
> [ 1621.336642] RAX: ffff956796604c00 RBX: ffff956796604c28 RCX: 
> 0000000000000000
> [ 1621.344606] RDX: ffff955000c2c4d8 RSI: 0000000000000001 RDI: 
> 0000000000000000
> [ 1621.352570] RBP: ffffb1a5e6727dc0 R08: 0000000000000002 R09: 
> ffffffffbb54b3c0
> [ 1621.360533] R10: ffffb1a5e6727d40 R11: fefefefefefefeff R12: 
> 0000000000000000
> [ 1621.368496] R13: ffff94d18dcfd000 R14: 0000000000000001 R15: 
> ffff955000caf140
> [ 1621.376460] FS:  0000000000000000(0000) GS:ffff95679f4c0000(0000) 
> knlGS:0000000000000000
> [ 1621.385490] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1621.391902] CR2: 0000000000000020 CR3: 0000009fa100a005 CR4: 
> 00000000007606e0
> [ 1621.399867] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [ 1621.407830] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
> 0000000000000400
> [ 1621.415793] PKRU: 55555554
> [ 1621.418814] Kernel panic - not syncing: Fatal exception
> [ 1621.424740] Kernel Offset: 0x39000000 from 0xffffffff81000000 
> (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [ 1621.550711] ---[ end Kernel panic - not syncing: Fatal exception ]---
>
>
> Thanks!
> -jane
>
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
Jane Chu May 16, 2019, 9:42 p.m. UTC | #7
Apology for resending in plain text.
-jane

On 5/16/19 9:45 AM, Jane Chu wrote:
> Hi,
> 
> I'm able to reproduce the panic below by running two sets of ndctl
> commands that actually serve legitimate purpose in parallel (unlike
> the brute force experiment earlier), each set in a indefinite loop.
> This time it takes about an hour to panic.  But I gather the cause
> is probably the same: I've overlapped ndctl commands on the same
> region.
> 
> Could we add a check in nd_ioctl(), such that if there is
> an ongoing ndctl command on a region, subsequent ndctl request
> will fail immediately with something to the effect of EAGAIN?
> The rationale being that kernel should protect itself against
> user mistakes.
> 
> Also, sensing the subject fix is for a different problem, and has been
> verified, I'm happy to see it in upstream, so we have a better
> code base to digger deeper in terms of how the destructive ndctl
> commands interacts to typical mission critical applications, include
> but not limited to rdma.
> 
> thanks,
> -jane
> 
> On 5/14/2019 2:18 PM, Jane Chu wrote:
>> On 5/14/2019 12:04 PM, Dan Williams wrote:
>>
>>> On Tue, May 14, 2019 at 11:53 AM Jane Chu <jane.chu@oracle.com> wrote:
>>>> On 5/13/2019 12:22 PM, Logan Gunthorpe wrote:
>>>>
>>>> On 2019-05-08 11:05 a.m., Logan Gunthorpe wrote:
>>>>
>>>> On 2019-05-07 5:55 p.m., Dan Williams wrote:
>>>>
>>>> Changes since v1 [1]:
>>>> - Fix a NULL-pointer deref crash in pci_p2pdma_release() (Logan)
>>>>
>>>> - Refresh the p2pdma patch headers to match the format of other p2pdma
>>>>     patches (Bjorn)
>>>>
>>>> - Collect Ira's reviewed-by
>>>>
>>>> [1]: 
>>>> https://lore.kernel.org/lkml/155387324370.2443841.574715745262628837.stgit@dwillia2-desk3.amr.corp.intel.com/ 
>>>>
>>>>
>>>> This series looks good to me:
>>>>
>>>> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
>>>>
>>>> However, I haven't tested it yet but I intend to later this week.
>>>>
>>>> I've tested libnvdimm-pending which includes this series on my setup 
>>>> and
>>>> everything works great.
>>>>
>>>> Just wondering in a difference scenario where pmem pages are 
>>>> exported to
>>>> a KVM guest, and then by mistake the user issues "ndctl 
>>>> destroy-namespace -f",
>>>> will the kernel wait indefinitely until the user figures out to kill 
>>>> the guest
>>>> and release the pmem pages?
>>> It depends on whether the pages are pinned. Typically DAX memory
>>> mappings assigned to a guest are not pinned in the host and can be
>>> invalidated at any time. The pinning only occurs with VFIO and
>>> device-assignment which isn't the common case, especially since that
>>> configuration is blocked by fsdax. However, with devdax, yes you can
>>> arrange for the system to go into an indefinite wait.
>>>
>>> This somewhat ties back to the get_user_pages() vs DAX debate. The
>>> indefinite stall issue with device-assignment could be addressed with
>>> a requirement to hold a lease and expect that a lease revocation event
>>> may escalate to SIGKILL in response to 'ndctl destroy-namespace'. The
>>> expectation with device-dax is that it is already a raw interface with
>>> pointy edges and caveats, but I would not be opposed to introducing a
>>> lease semantic.
>>
>> Thanks for the quick response Dan.
>>
>> I am not convinced that the get_user_pages() vs FS-DAX dilemma is a 
>> perfect
>> comparison to "ndctl destroy-namespace -f" vs namespace-is-busy dilemma.
>>
>> Others might disagree with me, I thought that there is no risk of panic
>> if we fail "ndctl destroy-namespace -f" to honor a clean shutdown of the
>> user application. Also, both actions are on the same host, so in theory
>> the admin could shutdown the application before attempt a destructive
>> action.
>>
>> By allowing 'opposite' actions in competition with each other at fine
>> granularity, there is potential for panic in general, not necessarily 
>> with
>> pinned page I guess.  I just ran an experiment and panic'd the system.
>>
>> So, as Optane DCPMEM is generally for server/cloud deployment, and as
>> RAS is a priority for server over administrative commands, to allow
>> namespace management command to panic kernel is not an option?
>>
>> Here is my stress experiment -
>>   Start out with ./create_nm.sh to create as many 48G devdax namespaces
>> as possible. Once that's completed, firing up 6 actions in quick
>> successions in below order:
>>   -> ndctl destroy-namespace all -f
>>   -> ./create_nm.sh
>>   -> ndctl destroy-namespace all -f
>>   -> ./create_nm.sh
>>   -> ndctl destroy-namespace all -f
>>   -> ./create_nm.sh
>>
>> ==========  console message =======
>> Kernel 5.1.0-rc7-next-20190501-libnvdimm-pending on an x86_64
>>
>> ban25uut130 login: [ 1620.866813] BUG: kernel NULL pointer 
>> dereference, address: 0000000000000020
>> [ 1620.874585] #PF: supervisor read access in kernel mode
>> [ 1620.880319] #PF: error_code(0x0000) - not-present page
>> [ 1620.886052] PGD 0 P4D 0
>> [ 1620.888879] Oops: 0000 [#1] SMP NOPTI
>> [ 1620.892964] CPU: 19 PID: 5611 Comm: kworker/u130:3 Tainted: 
>> G        W         5.1.0-rc7-next-20190501-libnvdimm-pending #5
>> [ 1620.905389] Hardware name: Oracle Corporation ORACLE SERVER 
>> X8-2L/ASM,MTHRBD,2U, BIOS 52020101 05/07/2019
>> [ 1620.916069] Workqueue: events_unbound async_run_entry_fn
>> [ 1620.921997] RIP: 0010:klist_put+0x1b/0x6c
>> [ 1620.926471] Code: 48 8b 43 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 55 
>> 48 89 e5 41 56 41 89 f6 41 55 41 54 53 4c 8b 27 48 89 fb 49 83 e4 fe 
>> 4c 89 e7 <4d> 8b 6c 24 20 e8 3a d4 01 00 45 84 f6 74 10 48 8b 03 a8 01 
>> 74 02
>> [ 1620.947427] RSP: 0018:ffffb1a5e6727da0 EFLAGS: 00010246
>> [ 1620.953258] RAX: ffff956796604c00 RBX: ffff956796604c28 RCX: 
>> 0000000000000000
>> [ 1620.961223] RDX: ffff955000c2c4d8 RSI: 0000000000000001 RDI: 
>> 0000000000000000
>> [ 1620.969185] RBP: ffffb1a5e6727dc0 R08: 0000000000000002 R09: 
>> ffffffffbb54b3c0
>> [ 1620.977150] R10: ffffb1a5e6727d40 R11: fefefefefefefeff R12: 
>> 0000000000000000
>> [ 1620.985116] R13: ffff94d18dcfd000 R14: 0000000000000001 R15: 
>> ffff955000caf140
>> [ 1620.993081] FS:  0000000000000000(0000) GS:ffff95679f4c0000(0000) 
>> knlGS:0000000000000000
>> [ 1621.002113] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1621.008524] CR2: 0000000000000020 CR3: 0000009fa100a005 CR4: 
>> 00000000007606e0
>> [ 1621.016487] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
>> 0000000000000000
>> [ 1621.024450] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
>> 0000000000000400
>> [ 1621.032413] PKRU: 55555554
>> [ 1621.035433] Call Trace:
>> [ 1621.038161]  klist_del+0xe/0x10
>> [ 1621.041667]  device_del+0x8a/0x2c9
>> [ 1621.045463]  ? __switch_to_asm+0x34/0x70
>> [ 1621.049840]  ? __switch_to_asm+0x40/0x70
>> [ 1621.054220]  device_unregister+0x44/0x4f
>> [ 1621.058603]  nd_async_device_unregister+0x22/0x2d [libnvdimm]
>> [ 1621.065016]  async_run_entry_fn+0x47/0x15a
>> [ 1621.069588]  process_one_work+0x1a2/0x2eb
>> [ 1621.074064]  worker_thread+0x1b8/0x26e
>> [ 1621.078239]  ? cancel_delayed_work_sync+0x15/0x15
>> [ 1621.083490]  kthread+0xf8/0xfd
>> [ 1621.086897]  ? kthread_destroy_worker+0x45/0x45
>> [ 1621.091954]  ret_from_fork+0x1f/0x40
>> [ 1621.095944] Modules linked in: xt_REDIRECT xt_nat xt_CHECKSUM 
>> iptable_mangle xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 
>> tun bridge stp llc ebtable_filter ebtables ip6table_filter 
>> iptable_filter scsi_transport_iscsi ip6table_nat ip6_tables 
>> iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 vfat fat 
>> skx_edac intel_powerclamp coretemp kvm_intel kvm irqbypass 
>> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt 
>> iTCO_vendor_support aesni_intel ipmi_si crypto_simd cryptd glue_helper 
>> ipmi_devintf ipmi_msghandler sg pcspkr dax_pmem_compat device_dax 
>> dax_pmem_core i2c_i801 pcc_cpufreq lpc_ich ioatdma wmi nfsd 
>> auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c nd_pmem 
>> nd_btt sr_mod cdrom sd_mod mgag200 drm_kms_helper syscopyarea 
>> crc32c_intel sysfillrect sysimgblt fb_sys_fops ttm megaraid_sas drm 
>> igb ahci libahci ptp libata pps_core dca i2c_algo_bit nfit libnvdimm 
>> uas usb_storage dm_mirror dm_region_hash dm_log dm_mod
>> [ 1621.189449] CR2: 0000000000000020
>> [ 1621.193169] ---[ end trace 7c3f7029ef24aa5a ]---
>> [ 1621.305383] RIP: 0010:klist_put+0x1b/0x6c
>> [ 1621.309860] Code: 48 8b 43 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 55 
>> 48 89 e5 41 56 41 89 f6 41 55 41 54 53 4c 8b 27 48 89 fb 49 83 e4 fe 
>> 4c 89 e7 <4d> 8b 6c 24 20 e8 3a d4 01 00 45 84 f6 74 10 48 8b 03 a8 01 
>> 74 02
>> [ 1621.330809] RSP: 0018:ffffb1a5e6727da0 EFLAGS: 00010246
>> [ 1621.336642] RAX: ffff956796604c00 RBX: ffff956796604c28 RCX: 
>> 0000000000000000
>> [ 1621.344606] RDX: ffff955000c2c4d8 RSI: 0000000000000001 RDI: 
>> 0000000000000000
>> [ 1621.352570] RBP: ffffb1a5e6727dc0 R08: 0000000000000002 R09: 
>> ffffffffbb54b3c0
>> [ 1621.360533] R10: ffffb1a5e6727d40 R11: fefefefefefefeff R12: 
>> 0000000000000000
>> [ 1621.368496] R13: ffff94d18dcfd000 R14: 0000000000000001 R15: 
>> ffff955000caf140
>> [ 1621.376460] FS:  0000000000000000(0000) GS:ffff95679f4c0000(0000) 
>> knlGS:0000000000000000
>> [ 1621.385490] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1621.391902] CR2: 0000000000000020 CR3: 0000009fa100a005 CR4: 
>> 00000000007606e0
>> [ 1621.399867] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
>> 0000000000000000
>> [ 1621.407830] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
>> 0000000000000400
>> [ 1621.415793] PKRU: 55555554
>> [ 1621.418814] Kernel panic - not syncing: Fatal exception
>> [ 1621.424740] Kernel Offset: 0x39000000 from 0xffffffff81000000 
>> (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>> [ 1621.550711] ---[ end Kernel panic - not syncing: Fatal exception ]---
>>
>>
>> Thanks!
>> -jane
>>
>> _______________________________________________
>> Linux-nvdimm mailing list
>> Linux-nvdimm@lists.01.org
>> https://lists.01.org/mailman/listinfo/linux-nvdimm
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
Dan Williams May 16, 2019, 9:51 p.m. UTC | #8
On Thu, May 16, 2019 at 9:45 AM Jane Chu <jane.chu@oracle.com> wrote:
>
> Hi,
>
> I'm able to reproduce the panic below by running two sets of ndctl
> commands that actually serve legitimate purpose in parallel (unlike
> the brute force experiment earlier), each set in a indefinite loop.
> This time it takes about an hour to panic.  But I gather the cause
> is probably the same: I've overlapped ndctl commands on the same
> region.
>
> Could we add a check in nd_ioctl(), such that if there is
> an ongoing ndctl command on a region, subsequent ndctl request
> will fail immediately with something to the effect of EAGAIN?
> The rationale being that kernel should protect itself against
> user mistakes.

We do already have locking in the driver to prevent configuration
collisions. The problem looks to be broken assumptions about running
the device unregistration path in a separate thread outside the lock.
I suspect it may be incorrect assumptions about the userspace
visibility of the device relative to teardown actions. To be clear
this isn't the nd_ioctl() path this is the sysfs path.


> Also, sensing the subject fix is for a different problem, and has been
> verified, I'm happy to see it in upstream, so we have a better
> code base to digger deeper in terms of how the destructive ndctl
> commands interacts to typical mission critical applications, include
> but not limited to rdma.

Right, the crash signature you are seeing looks unrelated to the issue
being address in these patches which is device-teardown racing active
page pins. I'll start the investigation on the crash signature, but
again I don't think it reads on this fix series.
Jane Chu May 17, 2019, 12:01 a.m. UTC | #9
On 5/16/2019 2:51 PM, Dan Williams wrote:

> On Thu, May 16, 2019 at 9:45 AM Jane Chu <jane.chu@oracle.com> wrote:
>> Hi,
>>
>> I'm able to reproduce the panic below by running two sets of ndctl
>> commands that actually serve legitimate purpose in parallel (unlike
>> the brute force experiment earlier), each set in a indefinite loop.
>> This time it takes about an hour to panic.  But I gather the cause
>> is probably the same: I've overlapped ndctl commands on the same
>> region.
>>
>> Could we add a check in nd_ioctl(), such that if there is
>> an ongoing ndctl command on a region, subsequent ndctl request
>> will fail immediately with something to the effect of EAGAIN?
>> The rationale being that kernel should protect itself against
>> user mistakes.
> We do already have locking in the driver to prevent configuration
> collisions. The problem looks to be broken assumptions about running
> the device unregistration path in a separate thread outside the lock.
> I suspect it may be incorrect assumptions about the userspace
> visibility of the device relative to teardown actions. To be clear
> this isn't the nd_ioctl() path this is the sysfs path.

I see, thanks!

>
>> Also, sensing the subject fix is for a different problem, and has been
>> verified, I'm happy to see it in upstream, so we have a better
>> code base to digger deeper in terms of how the destructive ndctl
>> commands interacts to typical mission critical applications, include
>> but not limited to rdma.
> Right, the crash signature you are seeing looks unrelated to the issue
> being address in these patches which is device-teardown racing active
> page pins. I'll start the investigation on the crash signature, but
> again I don't think it reads on this fix series.

Agreed on investigating the crash as separate issue, looking forward
to see this patchset in upstream.

Thanks!
-jane
Dan Williams May 31, 2019, 4:17 a.m. UTC | #10
On Mon, May 13, 2019 at 12:22 PM Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
>
> On 2019-05-08 11:05 a.m., Logan Gunthorpe wrote:
> >
> >
> > On 2019-05-07 5:55 p.m., Dan Williams wrote:
> >> Changes since v1 [1]:
> >> - Fix a NULL-pointer deref crash in pci_p2pdma_release() (Logan)
> >>
> >> - Refresh the p2pdma patch headers to match the format of other p2pdma
> >>    patches (Bjorn)
> >>
> >> - Collect Ira's reviewed-by
> >>
> >> [1]: https://lore.kernel.org/lkml/155387324370.2443841.574715745262628837.stgit@dwillia2-desk3.amr.corp.intel.com/
> >
> > This series looks good to me:
> >
> > Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
> >
> > However, I haven't tested it yet but I intend to later this week.
>
> I've tested libnvdimm-pending which includes this series on my setup and
> everything works great.

Hi Andrew,

With this tested-by can we move forward on this fix set? I'm not aware
of any other remaining comments. Greg had a question about
"drivers/base/devres: Introduce devm_release_action()" that I
answered, but otherwise the feedback has gone silent.