Message ID | CAPcyv4i7LURhWQvDtxu6Bdtr2sEEHqjPqaE6a7oBYX7raNqyCw@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 04/14/2017 11:50 AM, Dan Williams wrote: > > I have not been able to reproduce this panic, but does the following > make a difference on your systems? > > diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c > index bc02f28ed8b8..afb7a7efc12a 100644 > --- a/tools/testing/nvdimm/test/nfit.c > +++ b/tools/testing/nvdimm/test/nfit.c > @@ -1983,9 +1983,9 @@ static __exit void nfit_test_exit(void) > { > int i; > > - platform_driver_unregister(&nfit_test_driver); > for (i = 0; i < NUM_NFITS; i++) > platform_device_unregister(&instances[i]->pdev); > + platform_driver_unregister(&nfit_test_driver); > nfit_test_teardown(); > class_destroy(nfit_test_dimm); > } I tried it on both of my systems. My 72-core server has NVDIMMs and my 4-core laptop does not. Both systems still panic but now they panic before trying to rmmod the module, which I had commented out of my script. I tried it several times and got panics in different modules (xfs, selinux, etc) but all seem to be related to kmem. See below for an example. It feels like corruption that has just moved a bit. If you recall from my original mail, I reported two kinds of panics and this looks just like the second panic. Since it never happens for you and always happens for me, it feels like a build/procedure problem. I haven't had a chance to try in a VM. Have you had a chance to try on bare metal? -- ljk This is what I ran: $ sudo ./test.sh + modprobe nfit + modprobe dax + modprobe dax_pmem + modprobe libnvdimm + modprobe nd_blk + modprobe nd_btt + modprobe nd_e820 + modprobe nd_pmem + lsmod + grep nfit nfit 49152 8 libnvdimm 135168 6 nd_btt,nd_pmem,nd_e820,nd_blk,dax_pmem,nfit nfit_test_iomap 16384 4 nd_pmem,dax_pmem,nfit,libnvdimm + modprobe nfit_test + lsmod + grep nfit nfit_test 28672 6 nfit 49152 9 nfit_test libnvdimm 135168 7 nfit_test,nd_btt,nd_pmem,nd_e820,nd_blk,dax_pmem,nfit nfit_test_iomap 16384 5 nfit_test,nd_pmem,dax_pmem,nfit,libnvdimm + ndctl disable-region all This is what I got: [ 70.247427] nfit_test nfit_test.0: failed to evaluate _FIT [ 71.340634] BUG: unable to handle kernel paging request at ffffeb04002e00a0 [ 71.341359] nd btt9.0: nd_btt_release [ 71.341396] nd_bus ndbus1: nd_region.remove(region9) = 0 [ 71.341399] nd_bus ndbus1: nvdimm_map_release: 0xffffc90006d00000 [ 71.341401] nd_bus ndbus1: nvdimm_map_release: 0xffffc90026003000 [ 71.342597] nd btt11.0: nd_btt_release [ 71.342599] nd_bus ndbus1: nd_region.remove(region11) = 0 [ 71.342601] nd_bus ndbus1: nvdimm_map_release: 0xffffc90006d5e000 [ 71.342603] nd_bus ndbus1: nvdimm_map_release: 0xffffc9002a005000 [ 71.343797] nd pfn13.0: nd_pfn_release [ 71.343837] nd btt13.0: nd_btt_release [ 71.343848] nd dax13.0: nd_dax_release [ 71.343850] nd_bus ndbus1: nd_region.remove(region13) = 0 [ 71.343852] nd_bus ndbus1: nvdimm_map_release: 0xffffc90003841000 [ 71.343853] nd_bus ndbus1: nvdimm_map_release: 0xffffc90003819000 [ 71.345058] nd btt8.0: nd_btt_release [ 71.345090] nd_bus ndbus1: nd_region.remove(region8) = 0 [ 71.345092] nd_bus ndbus1: nvdimm_map_release: 0xffffc90006b91000 [ 71.345093] nd_bus ndbus1: nvdimm_map_release: 0xffffc90024002000 [ 71.345095] nd_bus ndbus1: nvdimm_map_release: 0xffffc90003811000 [ 71.346280] nd btt10.0: nd_btt_release [ 71.346311] nd_bus ndbus1: nd_region.remove(region10) = 0 [ 71.346313] nd_bus ndbus1: nvdimm_map_release: 0xffffc90006d3d000 [ 71.346314] nd_bus ndbus1: nvdimm_map_release: 0xffffc90028004000 [ 71.346316] nd_bus ndbus1: nvdimm_map_release: 0xffffc90003821000 [ 71.348129] nd btt1.0: nd_btt_release [ 71.348161] nd pfn1.0: nd_pfn_release [ 71.348187] nd dax1.0: nd_dax_release [ 71.348217] nd_bus ndbus0: nd_pmem.remove(pfn1.1) = 0 [ 72.024210] IP: kmem_cache_free+0x5a/0x1f0 [ 72.042828] PGD 0 [ 72.042829] [ 72.058492] Oops: 0000 [#1] SMP [ 72.072584] Modules linked in: nfit_test(O) nd_e820(O) nd_blk(O) ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp vfat fat coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc ipmi_ssif aesni_intel crypto_simd glue_helper cryptd nd_pmem(O) nd_btt(O) dax_pmem(O) iTCO_wdt dax(O) sg i2c_i801 wmi hpilo hpwdt ioatdma iTCO_vendor_support [ 72.402497] ipmi_si shpchp dca pcspkr ipmi_devintf lpc_ich nfit(O) libnvdimm(O) ipmi_msghandler acpi_power_meter nfit_test_iomap(O) ip_tables xfs sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm bnx2x tg3 mdio ptp hpsa i2c_core pps_core libcrc32c scsi_transport_sas crc32c_intel [ 72.533082] CPU: 3 PID: 2112 Comm: in:imjournal Tainted: G O 4.11.0-rc5+ #3 [ 72.570128] Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016 [ 72.607350] task: ffff88105a664380 task.stack: ffffc9000683c000 [ 72.633968] RIP: 0010:kmem_cache_free+0x5a/0x1f0 [ 72.654852] RSP: 0018:ffffc9000683f9c8 EFLAGS: 00010282 [ 72.678481] RAX: ffffeb04002e0080 RBX: ffffc9000b802000 RCX: 0000000000000002 [ 72.710704] RDX: 000077ff80000000 RSI: ffffc9000b802000 RDI: ffff88017fc07ac0 [ 72.742815] RBP: ffffc9000683f9e0 R08: ffffc9000b802008 R09: ffffffffc0458dc5 [ 72.775750] R10: ffff88046f4de660 R11: ffffea0011a025c0 R12: ffff88017fc07ac0 [ 72.809516] R13: 0000000000000018 R14: ffff880468097200 R15: 0000000000000000 [ 72.843108] FS: 00007f2e10184700(0000) GS:ffff88046f4c0000(0000) knlGS:0000000000000000 [ 72.882118] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 72.910341] CR2: ffffeb04002e00a0 CR3: 0000001054d31000 CR4: 00000000003406e0 [ 72.945776] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 72.980142] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 73.013916] Call Trace: [ 73.025544] xfs_trans_free_item_desc+0x45/0x50 [xfs] [ 73.049367] xfs_trans_free_items+0x80/0xb0 [xfs] [ 73.071701] xfs_log_commit_cil+0x47c/0x5d0 [xfs] [ 73.093910] __xfs_trans_commit+0x128/0x230 [xfs] [ 73.116015] xfs_trans_commit+0x10/0x20 [xfs] [ 73.136621] xfs_create+0x6fa/0x740 [xfs] [ 73.155495] xfs_generic_create+0x1ee/0x2d0 [xfs] [ 73.177635] ? __d_lookup_done+0x7c/0xe0 [ 73.196182] xfs_vn_mknod+0x14/0x20 [xfs] [ 73.214656] xfs_vn_create+0x13/0x20 [xfs] [ 73.233910] path_openat+0xed6/0x13c0 [ 73.251135] ? futex_wake_op+0x421/0x620 [ 73.269542] do_filp_open+0x91/0x100 [ 73.286798] ? do_futex+0x14b/0x570 [ 73.303905] ? __alloc_fd+0x46/0x170 [ 73.320897] do_sys_open+0x124/0x210 [ 73.338029] ? __audit_syscall_exit+0x209/0x290 [ 73.359419] SyS_open+0x1e/0x20 [ 73.374132] do_syscall_64+0x67/0x180 [ 73.391704] entry_SYSCALL64_slow_path+0x25/0x25 [ 73.414681] RIP: 0033:0x7f2e133aaa2d [ 73.432523] RSP: 002b:00007f2e10183970 EFLAGS: 00000293 ORIG_RAX: 0000000000000002 [ 73.470497] RAX: ffffffffffffffda RBX: 00007f2e12c5394f RCX: 00007f2e133aaa2d [ 73.504061] RDX: 00000000000001b6 RSI: 0000000000000241 RDI: 00007f2e10183a20 [ 73.537942] RBP: 00007f2e101839d0 R08: 00007f2e12c53954 R09: 0000000000000240 [ 73.571752] R10: 0000000000000024 R11: 0000000000000293 R12: 00007f2e00001f70 [ 73.605982] R13: 0000000000000004 R14: 00007f2e000012f0 R15: 000000000000000a [ 73.640046] Code: b8 00 00 00 80 4c 8b 4d 08 48 8b 15 b1 d8 9f 00 48 01 d8 0f 83 b7 00 00 00 48 01 d0 48 c1 e8 0c 48 c1 e0 06 48 03 05 9e 4f a3 00 <4c> 8b 58 20 41 f6 c3 01 0f 85 56 01 00 00 49 89 c3 4c 8b 17 65 [ 73.729777] RIP: kmem_cache_free+0x5a/0x1f0 RSP: ffffc9000683f9c8 [ 73.758552] CR2: ffffeb04002e00a0 [ 73.774478] ---[ end trace f5ad68bbafdb5b54 ]--- [ 73.801397] Kernel panic - not syncing: Fatal exception [ 73.825880] Kernel Offset: disabled [ 73.849946] ---[ end Kernel panic - not syncing: Fatal exception > _______________________________________________ > Linux-nvdimm mailing list > Linux-nvdimm@lists.01.org > https://lists.01.org/mailman/listinfo/linux-nvdimm >
On Fri, Apr 14, 2017 at 3:12 PM, Linda Knippers <linda.knippers@hpe.com> wrote: > On 04/14/2017 11:50 AM, Dan Williams wrote: >> >> I have not been able to reproduce this panic, but does the following >> make a difference on your systems? >> >> diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c >> index bc02f28ed8b8..afb7a7efc12a 100644 >> --- a/tools/testing/nvdimm/test/nfit.c >> +++ b/tools/testing/nvdimm/test/nfit.c >> @@ -1983,9 +1983,9 @@ static __exit void nfit_test_exit(void) >> { >> int i; >> >> - platform_driver_unregister(&nfit_test_driver); >> for (i = 0; i < NUM_NFITS; i++) >> platform_device_unregister(&instances[i]->pdev); >> + platform_driver_unregister(&nfit_test_driver); >> nfit_test_teardown(); >> class_destroy(nfit_test_dimm); >> } > > I tried it on both of my systems. My 72-core server has NVDIMMs and my > 4-core laptop does not. Both systems still panic but now > they panic before trying to rmmod the module, which I had commented > out of my script. I tried it several times and got panics in different > modules (xfs, selinux, etc) but all seem to be related to kmem. See > below for an example. > > It feels like corruption that has just moved a bit. If you recall > from my original mail, I reported two kinds of panics and this looks > just like the second panic. > > Since it never happens for you and always happens for me, it feels > like a build/procedure problem. I haven't had a chance to try in > a VM. Have you had a chance to try on bare metal? I have not, but I did send out a new patch that makes sure to shutdown the nfit_test trickery before freeing the test objects. Does this patch cleanup any of the failures you are seeing? https://patchwork.kernel.org/patch/9681861/
On 04/14/2017 06:28 PM, Dan Williams wrote: > > I have not, but I did send out a new patch that makes sure to shutdown > the nfit_test trickery before freeing the test objects. Does this > patch cleanup any of the failures you are seeing? > > https://patchwork.kernel.org/patch/9681861/ > I tried it and got similar results. On my server, once it panicked in rmmod, once it panicked when disabling the regions. I got excited on my laptop because it survived the rmmod, but then panicked a few seconds later. I also got a panic that might have been during the modprobe but I can't be sure because I don't get good information when my laptop crashes. I see people posting tests so it's working for some people. Anyone else interested in sharing what you're running on and exactly how you're building and installing your kernel and these modules? -- ljk
On Fri, Apr 14, 2017 at 4:31 PM, Linda Knippers <linda.knippers@hpe.com> wrote: > On 04/14/2017 06:28 PM, Dan Williams wrote: >> >> I have not, but I did send out a new patch that makes sure to shutdown >> the nfit_test trickery before freeing the test objects. Does this >> patch cleanup any of the failures you are seeing? >> >> https://patchwork.kernel.org/patch/9681861/ >> > > I tried it and got similar results. On my server, once it panicked in rmmod, once it > panicked when disabling the regions. I got excited on my laptop because it survived > the rmmod, but then panicked a few seconds later. I also got a panic that might > have been during the modprobe but I can't be sure because I don't get good information > when my laptop crashes. > > I see people posting tests so it's working for some people. Anyone else > interested in sharing what you're running on and exactly how you're building > and installing your kernel and these modules? > So I reproduced it! ...but only once, and then it went away which is why I thought that fix might have been good. That at least tells us that it is something more fundamental and that even my build environment can hit this sometimes. I think it might be a case where we get false results from get_nfit_res() and that leads to random memory corruption.
On 4/14/2017 7:42 PM, Dan Williams wrote: > On Fri, Apr 14, 2017 at 4:31 PM, Linda Knippers <linda.knippers@hpe.com> wrote: >> On 04/14/2017 06:28 PM, Dan Williams wrote: >>> >>> I have not, but I did send out a new patch that makes sure to shutdown >>> the nfit_test trickery before freeing the test objects. Does this >>> patch cleanup any of the failures you are seeing? >>> >>> https://patchwork.kernel.org/patch/9681861/ >>> >> >> I tried it and got similar results. On my server, once it panicked in rmmod, once it >> panicked when disabling the regions. I got excited on my laptop because it survived >> the rmmod, but then panicked a few seconds later. I also got a panic that might >> have been during the modprobe but I can't be sure because I don't get good information >> when my laptop crashes. >> >> I see people posting tests so it's working for some people. Anyone else >> interested in sharing what you're running on and exactly how you're building >> and installing your kernel and these modules? >> > > So I reproduced it! ...but only once, and then it went away which is > why I thought that fix might have been good. That at least tells us > that it is something more fundamental and that even my build > environment can hit this sometimes. I think it might be a case where > we get false results from get_nfit_res() and that leads to random > memory corruption. I'm so happy to hear that you reproduced it, even once. I was perfectly willing to accept that I was doing something wrong, and still might be since it is so reproducible for me, but maybe there's just a race that I keep losing. :-) -- ljk
diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c index bc02f28ed8b8..afb7a7efc12a 100644 --- a/tools/testing/nvdimm/test/nfit.c +++ b/tools/testing/nvdimm/test/nfit.c @@ -1983,9 +1983,9 @@ static __exit void nfit_test_exit(void) { int i; - platform_driver_unregister(&nfit_test_driver); for (i = 0; i < NUM_NFITS; i++) platform_device_unregister(&instances[i]->pdev); + platform_driver_unregister(&nfit_test_driver); nfit_test_teardown(); class_destroy(nfit_test_dimm); }