diff mbox

Kernel oops/panic with NFS over RDMA mount after disrupted Infiniband connection

Message ID 5332D425.3030803@ims.co.at (mailing list archive)
State New, archived
Headers show

Commit Message

rafael.reiter March 26, 2014, 1:20 p.m. UTC
Hello,

I am looking into a problem with NFS/RDMA with openSuse 12.3 and the 3.10.17 kernel. The following kernel oops, followed by a kernel panic, occurs on
the client computer after 5-30 seconds when the connection between NFS server and client is disrupted (e.g. by pulling the Infiniband cable on the
server or using ibportstate to disable the port on the client) and the mounted directory is accessed (cd, ls).


The export on the server side is done with
/data
172.16.100.0/24(rw,wdelay,crossmnt,insecure,no_root_squash,no_subtree_check,fsid=0,mountpoint)

Following command is used for mounting the NFSv4 share:
  mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/


[ 3336.995934] kernel tried to execute NX-protected page - exploit attempt?
(uid: 0)
[ 3337.003415] BUG: unable to handle kernel paging request at ffff880842900768
[ 3337.010404] IP: [<ffff880842900768>] 0xffff880842900767
[ 3337.015658] PGD 1d7c067 PUD 85d4e1063 PMD 842f48063 PTE 8000000842900163
[ 3337.022420] Oops: 0011 [#1] SMP
[ 3337.025681] Modules linked in: xprtrdma(O) auth_rpcgss oid_registry nfsv4
cpuid af_packet 8021q garp stp llc rdma_ucm(O) ib_ucm(O) rdma_cm(O) iw_cm(O)
ib_ipoib(O) ib_cm(O) ib_uverbs(O) ib_umad(O) mlx4_en(O) mlx4_ib(O) ib_sa(O)
ib_mad(O) ib_core(O) ib_addr(O) sr_mod cdrom usb_storage nvidia(PO) joydev
usbhid mlx4_core(O) compat(O) adm1021 lm90 coretemp nouveau kvm_intel kvm
crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul
glue_helper aes_x86_64 acpi_cpufreq iTCO_wdt iTCO_vendor_support microcode
pcspkr ttm drm_kms_helper sb_edac edac_core isci drm i2c_i801 libsas ehci_pci
ehci_hcd scsi_transport_sas mxm_wmi sg video usbcore lpc_ich ioatdma mfd_core
usb_common shpchp pci_hotplug wmi mperf processor thermal_sys button edd fuse
autofs4 xfs libcrc32c nfsv3 nfs fscache lockd nfs_acl sunrpc igb dca
i2c_algo_bit ptp pps_core
[ 3337.102467] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P           O 3.10.17-ims2
#2
[ 3337.109863] Hardware name: Supermicro B9DRG-E/B9DRG-E, BIOS 3.0 09/04/2013
[ 3337.116736] task: ffffffff81a11440 ti: ffffffff81a00000 task.ti:
ffffffff81a00000
[ 3337.124218] RIP: 0010:[<ffff880842900768>]  [<ffff880842900768>]
0xffff880842900767
[ 3337.131892] RSP: 0018:ffff88087fc03e88  EFLAGS: 00010282
[ 3337.137208] RAX: 0000000000000286 RBX: ffff880842900768 RCX: ffff88085e2454a0
[ 3337.144335] RDX: ffff88085e2454a0 RSI: 0000000000000286 RDI: ffff88085e245480
[ 3337.151463] RBP: ffff88087fc03ea0 R08: ffff88085e24b170 R09: 0000000000000040
[ 3337.158588] R10: 0000000000000003 R11: dead000000100100 R12: ffff88085e245480
[ 3337.165716] R13: 0000000000000006 R14: 0000000000000006 R15: ffffffff81a5db90
[ 3337.172842] FS:  0000000000000000(0000) GS:ffff88087fc00000(0000)
knlGS:0000000000000000
[ 3337.180932] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3337.186671] CR2: ffff880842900768 CR3: 0000000001a0c000 CR4: 00000000000407f0
[ 3337.193809] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3337.200934] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3337.208061] Stack:
[ 3337.210073]  ffffffffa04f7cbe ffffffffa04fd388 0000000000000000
ffff88087fc03ec0
[ 3337.217530]  ffffffff81049c82 0000000000000001 ffffffff81a050b0
ffff88087fc03f30
[ 3337.224987]  ffffffff81049870 ffffffff81a01fd8 00000001000b940f
0000000000202000
[ 3337.232443] Call Trace:
[ 3337.234903]  <IRQ>
[ 3337.236838]  [<ffffffffa04f7cbe>] ? rpcrdma_run_tasklet+0x7e/0xc0 [xprtrdma]
[ 3337.244116]  [<ffffffff81049c82>] tasklet_action+0x52/0xc0
[ 3337.249611]  [<ffffffff81049870>] __do_softirq+0xe0/0x220
[ 3337.255014]  [<ffffffff8155cbac>] call_softirq+0x1c/0x30
[ 3337.260335]  [<ffffffff8100452d>] do_softirq+0x4d/0x80
[ 3337.265470]  [<ffffffff81049b05>] irq_exit+0x95/0xa0
[ 3337.270437]  [<ffffffff8100411e>] do_IRQ+0x5e/0xd0
[ 3337.275224]  [<ffffffff81553eaa>] common_interrupt+0x6a/0x6a
[ 3337.280884]  <EOI>
[ 3337.282809]  [<ffffffff81069090>] ? __hrtimer_start_range_ns+0x1c0/0x400
[ 3337.289729]  [<ffffffff8141de86>] ? cpuidle_enter_state+0x56/0xd0
[ 3337.295824]  [<ffffffff8141de82>] ? cpuidle_enter_state+0x52/0xd0
[ 3337.301928]  [<ffffffff8141dfb6>] cpuidle_idle_call+0xb6/0x200
[ 3337.307764]  [<ffffffff8100aa39>] arch_cpu_idle+0x9/0x20
[ 3337.313087]  [<ffffffff81087cc0>] cpu_startup_entry+0x80/0x200
[ 3337.318924]  [<ffffffff815358a2>] rest_init+0x72/0x80
[ 3337.323984]  [<ffffffff81ac4e28>] start_kernel+0x3b2/0x3bf
[ 3337.329464]  [<ffffffff81ac4875>] ? repair_env_string+0x5e/0x5e
[ 3337.335386]  [<ffffffff81ac45a5>] x86_64_start_reservations+0x2a/0x2c
[ 3337.341820]  [<ffffffff81ac4675>] x86_64_start_kernel+0xce/0xd2
[ 3337.347732] Code: 88 ff ff e7 33 4f a0 ff ff ff ff 00 b0 dd 57 10 88 ff ff a9
be a9 be 10 00 00 00 00 01 00 00 00 00 00 00 20 00 00 00 00 00 00 00 <d0> 35 24
5e 08 88 ff ff 98 54 24 5e 08 88 ff ff 02 00 00 00 00
[ 3337.367691] RIP  [<ffff880842900768>] 0xffff880842900767
[ 3337.373024]  RSP <ffff88087fc03e88>
[ 3337.376510] CR2: ffff880842900768
[ 3337.379829] ---[ end trace 3c331859b0aa6701 ]---
[ 3337.385471] Kernel panic - not syncing: Fatal exception in interrupt
[ 3337.392852] ------------[ cut here ]------------
[ 3337.397477] general protection fault: 0000 [#2] SMP
[ 3337.402476] Modules linked in: xprtrdma(O) auth_rpcgss oid_registry nfsv4
cpuid af_packet 8021q garp stp llc rdma_ucm(O) ib_ucm(O) rdma_cm(O) iw_cm(O)
ib_ipoib(O) ib_cm(O) ib_uverbs(O) ib_umad(O) mlx4_en(O) mlx4_ib(O) ib_sa(O)
ib_mad(O) ib_core(O) ib_addr(O) sr_mod cdrom usb_storage nvidia(PO) joydev
usbhid mlx4_core(O) compat(O) adm1021 lm90 coretemp nouveau kvm_intel kvm
crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul
glue_helper aes_x86_64 acpi_cpufreq iTCO_wdt iTCO_vendor_support microcode
pcspkr ttm drm_kms_helper sb_edac edac_core isci drm i2c_i801 libsas ehci_pci
ehci_hcd scsi_transport_sas mxm_wmi sg video usbcore lpc_ich ioatdma mfd_core
usb_common shpchp pci_hotplug wmi mperf processor thermal_sys button edd fuse
autofs4 xfs libcrc32c nfsv3 nfs fscache lockd nfs_acl sunrpc igb dca
i2c_algo_bit ptp pps_core
[ 3337.478881] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P      D    O 3.10.17-ims2
#2
[ 3337.486275] Hardware name: Supermicro B9DRG-E/B9DRG-E, BIOS 3.0 09/04/2013
[ 3337.493142] task: ffffffff81a11440 ti: ffffffff81a00000 task.ti:
ffffffff81a00000
[ 3337.500624] RIP: 0010:[<ffffffff8102a5fb>]  [<ffffffff8102a5fb>]
native_apic_msr_write+0x2b/0x30
[ 3337.509425] RSP: 0018:ffff88087fc036f0  EFLAGS: 00010046
[ 3337.514732] RAX: 00000000000000f6 RBX: 00000000000005a2 RCX: 000000000000083f
[ 3337.521858] RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000003f
[ 3337.528987] RBP: ffff88087fc036f0 R08: 0000000000000000 R09: 00000000000005a2
[ 3337.536113] R10: 0000000000000000 R11: 00000000000005a1 R12: 0000000000000046
[ 3337.543240] R13: 0000000000000001 R14: 00000000000005a2 R15: 0000000000000006
[ 3337.550367] FS:  0000000000000000(0000) GS:ffff88087fc00000(0000)
knlGS:0000000000000000
[ 3337.558447] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3337.564186] CR2: ffff880842900768 CR3: 0000000001a0c000 CR4: 00000000000407f0
[ 3337.571315] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3337.578442] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3337.585568] Stack:
[ 3337.587589]  ffff88087fc03700 ffffffff8102a4d8 ffff88087fc03710
ffffffff810064d6
[ 3337.595045]  ffff88087fc03720 ffffffff810ef9d9 ffff88087fc03730
ffffffff8104394f
[ 3337.602502]  ffff88087fc03770 ffffffff81043b65 0000000000000000
0000000000000001
[ 3337.609957] Call Trace:
[ 3337.612403]  <IRQ>
[ 3337.614328]  [<ffffffff8102a4d8>] x2apic_send_IPI_self+0x18/0x20
[ 3337.620552]  [<ffffffff810064d6>] arch_irq_work_raise+0x26/0x40
[ 3337.626475]  [<ffffffff810ef9d9>] irq_work_queue+0xa9/0xd0
[ 3337.631963]  [<ffffffff8104394f>] wake_up_klogd+0x2f/0x40
[ 3337.637356]  [<ffffffff81043b65>] console_unlock+0x205/0x3d0
[ 3337.643010]  [<ffffffff81044115>] vprintk_emit+0x1c5/0x460
[ 3337.648499]  [<ffffffff8107c46a>] ? enqueue_entity+0x37a/0x990
[ 3337.654332]  [<ffffffff81024046>] ? native_smp_send_reschedule+0x56/0x60
[ 3337.661027]  [<ffffffff815498a9>] printk+0x5c/0x5e
[ 3337.665814]  [<ffffffff81024046>] ? native_smp_send_reschedule+0x56/0x60
[ 3337.672516]  [<ffffffff81041d09>] warn_slowpath_common+0x39/0xa0
[ 3337.678524]  [<ffffffff81041d85>] warn_slowpath_null+0x15/0x20
[ 3337.684359]  [<ffffffff81024046>] native_smp_send_reschedule+0x56/0x60
[ 3337.690887]  [<ffffffff8107f084>] trigger_load_balance+0x174/0x210
[ 3337.697069]  [<ffffffff81073715>] scheduler_tick+0x105/0x140
[ 3337.702731]  [<ffffffff81051e62>] update_process_times+0x62/0x80
[ 3337.708738]  [<ffffffff81091219>] tick_sched_handle.isra.10+0x29/0x70
[ 3337.715172]  [<ffffffff810913a7>] tick_sched_timer+0x47/0x70
[ 3337.720827]  [<ffffffff81068bea>] __run_hrtimer+0x6a/0x1b0
[ 3337.726307]  [<ffffffff81342c41>] ? erst_write+0xf1/0x220
[ 3337.731708]  [<ffffffff81091360>] ? tick_sched_do_timer+0x30/0x30
[ 3337.737795]  [<ffffffff810694df>] hrtimer_interrupt+0xff/0x230
[ 3337.743637]  [<ffffffff810250a4>] smp_apic_timer_interrupt+0x64/0xa0
[ 3337.749991]  [<ffffffff8155c53a>] apic_timer_interrupt+0x6a/0x70
[ 3337.755993]  [<ffffffff8154980f>] ? panic+0x181/0x1bf
[ 3337.761049]  [<ffffffff81549779>] ? panic+0xeb/0x1bf
[ 3337.766023]  [<ffffffff81554bca>] oops_end+0xda/0xe0
[ 3337.770991]  [<ffffffff815490f9>] no_context+0x24e/0x279
[ 3337.776307]  [<ffffffff8148f7cb>] ? ip_rcv_finish+0x7b/0x340
[ 3337.781961]  [<ffffffff815492d9>] __bad_area_nosemaphore+0x1b5/0x1d4
[ 3337.788316]  [<ffffffff81549306>] bad_area_nosemaphore+0xe/0x10
[ 3337.794229]  [<ffffffff81557406>] __do_page_fault+0x396/0x4e0
[ 3337.799971]  [<ffffffff8145d6ac>] ? __netif_receive_skb+0x1c/0x70
[ 3337.806067]  [<ffffffffa03c6c71>] ? mlx4_ib_poll_cq+0x91/0xe0 [mlx4_ib]
[ 3337.812681]  [<ffffffffa04f5ff6>] ? rpcrdma_cq_event_upcall+0x246/0x5a0
[xprtrdma]
[ 3337.820249]  [<ffffffff81557559>] do_page_fault+0x9/0x10
[ 3337.825556]  [<ffffffff815540b2>] page_fault+0x22/0x30
[ 3337.830688]  [<ffffffffa04f7cbe>] ? rpcrdma_run_tasklet+0x7e/0xc0 [xprtrdma]
[ 3337.837728]  [<ffffffff81049c82>] tasklet_action+0x52/0xc0
[ 3337.843208]  [<ffffffff81049870>] __do_softirq+0xe0/0x220
[ 3337.848603]  [<ffffffff8155cbac>] call_softirq+0x1c/0x30
[ 3337.853916]  [<ffffffff8100452d>] do_softirq+0x4d/0x80
[ 3337.859048]  [<ffffffff81049b05>] irq_exit+0x95/0xa0
[ 3337.864007]  [<ffffffff8100411e>] do_IRQ+0x5e/0xd0
[ 3337.868795]  [<ffffffff81553eaa>] common_interrupt+0x6a/0x6a
[ 3337.874454]  <EOI>
[ 3337.876380]  [<ffffffff81069090>] ? __hrtimer_start_range_ns+0x1c0/0x400
[ 3337.883290]  [<ffffffff8141de86>] ? cpuidle_enter_state+0x56/0xd0
[ 3337.889377]  [<ffffffff8141de82>] ? cpuidle_enter_state+0x52/0xd0
[ 3337.895464]  [<ffffffff8141dfb6>] cpuidle_idle_call+0xb6/0x200
[ 3337.901300]  [<ffffffff8100aa39>] arch_cpu_idle+0x9/0x20
[ 3337.906613]  [<ffffffff81087cc0>] cpu_startup_entry+0x80/0x200
[ 3337.912442]  [<ffffffff815358a2>] rest_init+0x72/0x80
[ 3337.917495]  [<ffffffff81ac4e28>] start_kernel+0x3b2/0x3bf
[ 3337.922974]  [<ffffffff81ac4875>] ? repair_env_string+0x5e/0x5e
[ 3337.928888]  [<ffffffff81ac45a5>] x86_64_start_reservations+0x2a/0x2c
[ 3337.935323]  [<ffffffff81ac4675>] x86_64_start_kernel+0xce/0xd2
[ 3337.941243] Code: 55 83 ff 20 48 89 e5 74 24 81 ff e0 00 00 00 74 1c 83 ff 30
74 17 81 ff d0 00 00 00 74 0f c1 ef 04 31 d2 89 f0 8d 8f 00 08 00 00 <0f> 30 5d
c3 90 55 81 ff e0 00 00 00 b8 ff ff ff ff 48 89 e5 74
[ 3337.961200] RIP  [<ffffffff8102a5fb>] native_apic_msr_write+0x2b/0x30
[ 3337.967653]  RSP <ffff88087fc036f0>
[ 3337.971139] ---[ end trace 3c331859b0aa6702 ]---

As deductable from rpcrdma_run_tasklet+0x7e, the problem lies in net/sunrpc/xprtrdma/verbs.c:95:

  77 static void
  78 rpcrdma_run_tasklet(unsigned long data)
  79 {
  80         struct rpcrdma_rep *rep;
  81         void (*func)(struct rpcrdma_rep *);
  82         unsigned long flags;
  83
  84         data = data;
  85         spin_lock_irqsave(&rpcrdma_tk_lock_g, flags);
  86         while (!list_empty(&rpcrdma_tasklets_g)) {
  87                 rep = list_entry(rpcrdma_tasklets_g.next,
  88                                  struct rpcrdma_rep, rr_list);
  89                 list_del(&rep->rr_list);
  90                 func = rep->rr_func;
  91                 rep->rr_func = NULL;
  92                 spin_unlock_irqrestore(&rpcrdma_tk_lock_g, flags);
  93
  94                 if (func)
  95                         func(rep);
  96                 else
  97                         rpcrdma_recv_buffer_put(rep);
  98
  99                 spin_lock_irqsave(&rpcrdma_tk_lock_g, flags);
 100         }
 101         spin_unlock_irqrestore(&rpcrdma_tk_lock_g, flags);
 102 }

func is a function pointer which, as far as I have seen, has the same value throughout normal operation, but as soon as the connection is lost
(if-condition in 158 is true), rpcrdma_event_process schedules a tasklet which leads to a faulty pointer func with the value LIST_POISON1
(0xdead000000100100):

 158         if (IB_WC_SUCCESS != wc->status) {
 159                 dprintk("RPC:       %s: WC opcode %d status %X, connection lost\n",
 160                         __func__, wc->opcode, wc->status);
 161                 rep->rr_len = ~0U;
 162                 if (wc->opcode != IB_WC_FAST_REG_MR && wc->opcode != IB_WC_LOCAL_INV)
 163                         rpcrdma_schedule_tasklet(rep);
 164                 return;
 165         }

In normal operation, rpcrdma_schedule_tasklet is called if wc->opcode is IB_WC_RECV or IB_WC_BIND_MW (verbs.c, line 167-208). I adjusted the
if-condition in 162 accordingly:



This seems to fix the kernel oops/panic. I guess this is not the most elegant way, and I kindly ask you to improve on this.

As similar error is referenced briefly by Yan Burman in this posting from 02/18/2013 12:44 PM.

Please also consider my other post (which I will post later today) about a failure to reinstate the mount after a lost connection.

Best,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Chuck Lever III March 26, 2014, 6:15 p.m. UTC | #1
On Mar 26, 2014, at 6:20 AM, rafael.reiter <rafael.reiter@ims.co.at> wrote:

> Hello,
> 
> I am looking into a problem with NFS/RDMA with openSuse 12.3 and the 3.10.17 kernel. The following kernel oops, followed by a kernel panic, occurs on
> the client computer after 5-30 seconds when the connection between NFS server and client is disrupted (e.g. by pulling the Infiniband cable on the
> server or using ibportstate to disable the port on the client) and the mounted directory is accessed (cd, ls).
> 
> 
> The export on the server side is done with
> /data
> 172.16.100.0/24(rw,wdelay,crossmnt,insecure,no_root_squash,no_subtree_check,fsid=0,mountpoint)
> 
> Following command is used for mounting the NFSv4 share:
>  mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/
> 
> 
> [ 3336.995934] kernel tried to execute NX-protected page - exploit attempt?
> (uid: 0)
> [ 3337.003415] BUG: unable to handle kernel paging request at ffff880842900768
> [ 3337.010404] IP: [<ffff880842900768>] 0xffff880842900767
> [ 3337.015658] PGD 1d7c067 PUD 85d4e1063 PMD 842f48063 PTE 8000000842900163
> [ 3337.022420] Oops: 0011 [#1] SMP
> [ 3337.025681] Modules linked in: xprtrdma(O) auth_rpcgss oid_registry nfsv4
> cpuid af_packet 8021q garp stp llc rdma_ucm(O) ib_ucm(O) rdma_cm(O) iw_cm(O)
> ib_ipoib(O) ib_cm(O) ib_uverbs(O) ib_umad(O) mlx4_en(O) mlx4_ib(O) ib_sa(O)
> ib_mad(O) ib_core(O) ib_addr(O) sr_mod cdrom usb_storage nvidia(PO) joydev
> usbhid mlx4_core(O) compat(O) adm1021 lm90 coretemp nouveau kvm_intel kvm
> crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul
> glue_helper aes_x86_64 acpi_cpufreq iTCO_wdt iTCO_vendor_support microcode
> pcspkr ttm drm_kms_helper sb_edac edac_core isci drm i2c_i801 libsas ehci_pci
> ehci_hcd scsi_transport_sas mxm_wmi sg video usbcore lpc_ich ioatdma mfd_core
> usb_common shpchp pci_hotplug wmi mperf processor thermal_sys button edd fuse
> autofs4 xfs libcrc32c nfsv3 nfs fscache lockd nfs_acl sunrpc igb dca
> i2c_algo_bit ptp pps_core
> [ 3337.102467] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P           O 3.10.17-ims2
> #2
> [ 3337.109863] Hardware name: Supermicro B9DRG-E/B9DRG-E, BIOS 3.0 09/04/2013
> [ 3337.116736] task: ffffffff81a11440 ti: ffffffff81a00000 task.ti:
> ffffffff81a00000
> [ 3337.124218] RIP: 0010:[<ffff880842900768>]  [<ffff880842900768>]
> 0xffff880842900767
> [ 3337.131892] RSP: 0018:ffff88087fc03e88  EFLAGS: 00010282
> [ 3337.137208] RAX: 0000000000000286 RBX: ffff880842900768 RCX: ffff88085e2454a0
> [ 3337.144335] RDX: ffff88085e2454a0 RSI: 0000000000000286 RDI: ffff88085e245480
> [ 3337.151463] RBP: ffff88087fc03ea0 R08: ffff88085e24b170 R09: 0000000000000040
> [ 3337.158588] R10: 0000000000000003 R11: dead000000100100 R12: ffff88085e245480
> [ 3337.165716] R13: 0000000000000006 R14: 0000000000000006 R15: ffffffff81a5db90
> [ 3337.172842] FS:  0000000000000000(0000) GS:ffff88087fc00000(0000)
> knlGS:0000000000000000
> [ 3337.180932] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 3337.186671] CR2: ffff880842900768 CR3: 0000000001a0c000 CR4: 00000000000407f0
> [ 3337.193809] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 3337.200934] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 3337.208061] Stack:
> [ 3337.210073]  ffffffffa04f7cbe ffffffffa04fd388 0000000000000000
> ffff88087fc03ec0
> [ 3337.217530]  ffffffff81049c82 0000000000000001 ffffffff81a050b0
> ffff88087fc03f30
> [ 3337.224987]  ffffffff81049870 ffffffff81a01fd8 00000001000b940f
> 0000000000202000
> [ 3337.232443] Call Trace:
> [ 3337.234903]  <IRQ>
> [ 3337.236838]  [<ffffffffa04f7cbe>] ? rpcrdma_run_tasklet+0x7e/0xc0 [xprtrdma]
> [ 3337.244116]  [<ffffffff81049c82>] tasklet_action+0x52/0xc0
> [ 3337.249611]  [<ffffffff81049870>] __do_softirq+0xe0/0x220
> [ 3337.255014]  [<ffffffff8155cbac>] call_softirq+0x1c/0x30
> [ 3337.260335]  [<ffffffff8100452d>] do_softirq+0x4d/0x80
> [ 3337.265470]  [<ffffffff81049b05>] irq_exit+0x95/0xa0
> [ 3337.270437]  [<ffffffff8100411e>] do_IRQ+0x5e/0xd0
> [ 3337.275224]  [<ffffffff81553eaa>] common_interrupt+0x6a/0x6a
> [ 3337.280884]  <EOI>
> [ 3337.282809]  [<ffffffff81069090>] ? __hrtimer_start_range_ns+0x1c0/0x400
> [ 3337.289729]  [<ffffffff8141de86>] ? cpuidle_enter_state+0x56/0xd0
> [ 3337.295824]  [<ffffffff8141de82>] ? cpuidle_enter_state+0x52/0xd0
> [ 3337.301928]  [<ffffffff8141dfb6>] cpuidle_idle_call+0xb6/0x200
> [ 3337.307764]  [<ffffffff8100aa39>] arch_cpu_idle+0x9/0x20
> [ 3337.313087]  [<ffffffff81087cc0>] cpu_startup_entry+0x80/0x200
> [ 3337.318924]  [<ffffffff815358a2>] rest_init+0x72/0x80
> [ 3337.323984]  [<ffffffff81ac4e28>] start_kernel+0x3b2/0x3bf
> [ 3337.329464]  [<ffffffff81ac4875>] ? repair_env_string+0x5e/0x5e
> [ 3337.335386]  [<ffffffff81ac45a5>] x86_64_start_reservations+0x2a/0x2c
> [ 3337.341820]  [<ffffffff81ac4675>] x86_64_start_kernel+0xce/0xd2
> [ 3337.347732] Code: 88 ff ff e7 33 4f a0 ff ff ff ff 00 b0 dd 57 10 88 ff ff a9
> be a9 be 10 00 00 00 00 01 00 00 00 00 00 00 20 00 00 00 00 00 00 00 <d0> 35 24
> 5e 08 88 ff ff 98 54 24 5e 08 88 ff ff 02 00 00 00 00
> [ 3337.367691] RIP  [<ffff880842900768>] 0xffff880842900767
> [ 3337.373024]  RSP <ffff88087fc03e88>
> [ 3337.376510] CR2: ffff880842900768
> [ 3337.379829] ---[ end trace 3c331859b0aa6701 ]---
> [ 3337.385471] Kernel panic - not syncing: Fatal exception in interrupt
> [ 3337.392852] ------------[ cut here ]------------
> [ 3337.397477] general protection fault: 0000 [#2] SMP
> [ 3337.402476] Modules linked in: xprtrdma(O) auth_rpcgss oid_registry nfsv4
> cpuid af_packet 8021q garp stp llc rdma_ucm(O) ib_ucm(O) rdma_cm(O) iw_cm(O)
> ib_ipoib(O) ib_cm(O) ib_uverbs(O) ib_umad(O) mlx4_en(O) mlx4_ib(O) ib_sa(O)
> ib_mad(O) ib_core(O) ib_addr(O) sr_mod cdrom usb_storage nvidia(PO) joydev
> usbhid mlx4_core(O) compat(O) adm1021 lm90 coretemp nouveau kvm_intel kvm
> crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul
> glue_helper aes_x86_64 acpi_cpufreq iTCO_wdt iTCO_vendor_support microcode
> pcspkr ttm drm_kms_helper sb_edac edac_core isci drm i2c_i801 libsas ehci_pci
> ehci_hcd scsi_transport_sas mxm_wmi sg video usbcore lpc_ich ioatdma mfd_core
> usb_common shpchp pci_hotplug wmi mperf processor thermal_sys button edd fuse
> autofs4 xfs libcrc32c nfsv3 nfs fscache lockd nfs_acl sunrpc igb dca
> i2c_algo_bit ptp pps_core
> [ 3337.478881] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P      D    O 3.10.17-ims2
> #2
> [ 3337.486275] Hardware name: Supermicro B9DRG-E/B9DRG-E, BIOS 3.0 09/04/2013
> [ 3337.493142] task: ffffffff81a11440 ti: ffffffff81a00000 task.ti:
> ffffffff81a00000
> [ 3337.500624] RIP: 0010:[<ffffffff8102a5fb>]  [<ffffffff8102a5fb>]
> native_apic_msr_write+0x2b/0x30
> [ 3337.509425] RSP: 0018:ffff88087fc036f0  EFLAGS: 00010046
> [ 3337.514732] RAX: 00000000000000f6 RBX: 00000000000005a2 RCX: 000000000000083f
> [ 3337.521858] RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000003f
> [ 3337.528987] RBP: ffff88087fc036f0 R08: 0000000000000000 R09: 00000000000005a2
> [ 3337.536113] R10: 0000000000000000 R11: 00000000000005a1 R12: 0000000000000046
> [ 3337.543240] R13: 0000000000000001 R14: 00000000000005a2 R15: 0000000000000006
> [ 3337.550367] FS:  0000000000000000(0000) GS:ffff88087fc00000(0000)
> knlGS:0000000000000000
> [ 3337.558447] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 3337.564186] CR2: ffff880842900768 CR3: 0000000001a0c000 CR4: 00000000000407f0
> [ 3337.571315] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 3337.578442] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 3337.585568] Stack:
> [ 3337.587589]  ffff88087fc03700 ffffffff8102a4d8 ffff88087fc03710
> ffffffff810064d6
> [ 3337.595045]  ffff88087fc03720 ffffffff810ef9d9 ffff88087fc03730
> ffffffff8104394f
> [ 3337.602502]  ffff88087fc03770 ffffffff81043b65 0000000000000000
> 0000000000000001
> [ 3337.609957] Call Trace:
> [ 3337.612403]  <IRQ>
> [ 3337.614328]  [<ffffffff8102a4d8>] x2apic_send_IPI_self+0x18/0x20
> [ 3337.620552]  [<ffffffff810064d6>] arch_irq_work_raise+0x26/0x40
> [ 3337.626475]  [<ffffffff810ef9d9>] irq_work_queue+0xa9/0xd0
> [ 3337.631963]  [<ffffffff8104394f>] wake_up_klogd+0x2f/0x40
> [ 3337.637356]  [<ffffffff81043b65>] console_unlock+0x205/0x3d0
> [ 3337.643010]  [<ffffffff81044115>] vprintk_emit+0x1c5/0x460
> [ 3337.648499]  [<ffffffff8107c46a>] ? enqueue_entity+0x37a/0x990
> [ 3337.654332]  [<ffffffff81024046>] ? native_smp_send_reschedule+0x56/0x60
> [ 3337.661027]  [<ffffffff815498a9>] printk+0x5c/0x5e
> [ 3337.665814]  [<ffffffff81024046>] ? native_smp_send_reschedule+0x56/0x60
> [ 3337.672516]  [<ffffffff81041d09>] warn_slowpath_common+0x39/0xa0
> [ 3337.678524]  [<ffffffff81041d85>] warn_slowpath_null+0x15/0x20
> [ 3337.684359]  [<ffffffff81024046>] native_smp_send_reschedule+0x56/0x60
> [ 3337.690887]  [<ffffffff8107f084>] trigger_load_balance+0x174/0x210
> [ 3337.697069]  [<ffffffff81073715>] scheduler_tick+0x105/0x140
> [ 3337.702731]  [<ffffffff81051e62>] update_process_times+0x62/0x80
> [ 3337.708738]  [<ffffffff81091219>] tick_sched_handle.isra.10+0x29/0x70
> [ 3337.715172]  [<ffffffff810913a7>] tick_sched_timer+0x47/0x70
> [ 3337.720827]  [<ffffffff81068bea>] __run_hrtimer+0x6a/0x1b0
> [ 3337.726307]  [<ffffffff81342c41>] ? erst_write+0xf1/0x220
> [ 3337.731708]  [<ffffffff81091360>] ? tick_sched_do_timer+0x30/0x30
> [ 3337.737795]  [<ffffffff810694df>] hrtimer_interrupt+0xff/0x230
> [ 3337.743637]  [<ffffffff810250a4>] smp_apic_timer_interrupt+0x64/0xa0
> [ 3337.749991]  [<ffffffff8155c53a>] apic_timer_interrupt+0x6a/0x70
> [ 3337.755993]  [<ffffffff8154980f>] ? panic+0x181/0x1bf
> [ 3337.761049]  [<ffffffff81549779>] ? panic+0xeb/0x1bf
> [ 3337.766023]  [<ffffffff81554bca>] oops_end+0xda/0xe0
> [ 3337.770991]  [<ffffffff815490f9>] no_context+0x24e/0x279
> [ 3337.776307]  [<ffffffff8148f7cb>] ? ip_rcv_finish+0x7b/0x340
> [ 3337.781961]  [<ffffffff815492d9>] __bad_area_nosemaphore+0x1b5/0x1d4
> [ 3337.788316]  [<ffffffff81549306>] bad_area_nosemaphore+0xe/0x10
> [ 3337.794229]  [<ffffffff81557406>] __do_page_fault+0x396/0x4e0
> [ 3337.799971]  [<ffffffff8145d6ac>] ? __netif_receive_skb+0x1c/0x70
> [ 3337.806067]  [<ffffffffa03c6c71>] ? mlx4_ib_poll_cq+0x91/0xe0 [mlx4_ib]
> [ 3337.812681]  [<ffffffffa04f5ff6>] ? rpcrdma_cq_event_upcall+0x246/0x5a0
> [xprtrdma]
> [ 3337.820249]  [<ffffffff81557559>] do_page_fault+0x9/0x10
> [ 3337.825556]  [<ffffffff815540b2>] page_fault+0x22/0x30
> [ 3337.830688]  [<ffffffffa04f7cbe>] ? rpcrdma_run_tasklet+0x7e/0xc0 [xprtrdma]
> [ 3337.837728]  [<ffffffff81049c82>] tasklet_action+0x52/0xc0
> [ 3337.843208]  [<ffffffff81049870>] __do_softirq+0xe0/0x220
> [ 3337.848603]  [<ffffffff8155cbac>] call_softirq+0x1c/0x30
> [ 3337.853916]  [<ffffffff8100452d>] do_softirq+0x4d/0x80
> [ 3337.859048]  [<ffffffff81049b05>] irq_exit+0x95/0xa0
> [ 3337.864007]  [<ffffffff8100411e>] do_IRQ+0x5e/0xd0
> [ 3337.868795]  [<ffffffff81553eaa>] common_interrupt+0x6a/0x6a
> [ 3337.874454]  <EOI>
> [ 3337.876380]  [<ffffffff81069090>] ? __hrtimer_start_range_ns+0x1c0/0x400
> [ 3337.883290]  [<ffffffff8141de86>] ? cpuidle_enter_state+0x56/0xd0
> [ 3337.889377]  [<ffffffff8141de82>] ? cpuidle_enter_state+0x52/0xd0
> [ 3337.895464]  [<ffffffff8141dfb6>] cpuidle_idle_call+0xb6/0x200
> [ 3337.901300]  [<ffffffff8100aa39>] arch_cpu_idle+0x9/0x20
> [ 3337.906613]  [<ffffffff81087cc0>] cpu_startup_entry+0x80/0x200
> [ 3337.912442]  [<ffffffff815358a2>] rest_init+0x72/0x80
> [ 3337.917495]  [<ffffffff81ac4e28>] start_kernel+0x3b2/0x3bf
> [ 3337.922974]  [<ffffffff81ac4875>] ? repair_env_string+0x5e/0x5e
> [ 3337.928888]  [<ffffffff81ac45a5>] x86_64_start_reservations+0x2a/0x2c
> [ 3337.935323]  [<ffffffff81ac4675>] x86_64_start_kernel+0xce/0xd2
> [ 3337.941243] Code: 55 83 ff 20 48 89 e5 74 24 81 ff e0 00 00 00 74 1c 83 ff 30
> 74 17 81 ff d0 00 00 00 74 0f c1 ef 04 31 d2 89 f0 8d 8f 00 08 00 00 <0f> 30 5d
> c3 90 55 81 ff e0 00 00 00 b8 ff ff ff ff 48 89 e5 74
> [ 3337.961200] RIP  [<ffffffff8102a5fb>] native_apic_msr_write+0x2b/0x30
> [ 3337.967653]  RSP <ffff88087fc036f0>
> [ 3337.971139] ---[ end trace 3c331859b0aa6702 ]---
> 
> As deductable from rpcrdma_run_tasklet+0x7e, the problem lies in net/sunrpc/xprtrdma/verbs.c:95:
> 
>  77 static void
>  78 rpcrdma_run_tasklet(unsigned long data)
>  79 {
>  80         struct rpcrdma_rep *rep;
>  81         void (*func)(struct rpcrdma_rep *);
>  82         unsigned long flags;
>  83
>  84         data = data;
>  85         spin_lock_irqsave(&rpcrdma_tk_lock_g, flags);
>  86         while (!list_empty(&rpcrdma_tasklets_g)) {
>  87                 rep = list_entry(rpcrdma_tasklets_g.next,
>  88                                  struct rpcrdma_rep, rr_list);
>  89                 list_del(&rep->rr_list);
>  90                 func = rep->rr_func;
>  91                 rep->rr_func = NULL;
>  92                 spin_unlock_irqrestore(&rpcrdma_tk_lock_g, flags);
>  93
>  94                 if (func)
>  95                         func(rep);
>  96                 else
>  97                         rpcrdma_recv_buffer_put(rep);
>  98
>  99                 spin_lock_irqsave(&rpcrdma_tk_lock_g, flags);
> 100         }
> 101         spin_unlock_irqrestore(&rpcrdma_tk_lock_g, flags);
> 102 }
> 
> func is a function pointer which, as far as I have seen, has the same value throughout normal operation, but as soon as the connection is lost
> (if-condition in 158 is true), rpcrdma_event_process schedules a tasklet which leads to a faulty pointer func with the value LIST_POISON1
> (0xdead000000100100):
> 
> 158         if (IB_WC_SUCCESS != wc->status) {
> 159                 dprintk("RPC:       %s: WC opcode %d status %X, connection lost\n",
> 160                         __func__, wc->opcode, wc->status);
> 161                 rep->rr_len = ~0U;
> 162                 if (wc->opcode != IB_WC_FAST_REG_MR && wc->opcode != IB_WC_LOCAL_INV)
> 163                         rpcrdma_schedule_tasklet(rep);
> 164                 return;
> 165         }
> 
> In normal operation, rpcrdma_schedule_tasklet is called if wc->opcode is IB_WC_RECV or IB_WC_BIND_MW (verbs.c, line 167-208). I adjusted the
> if-condition in 162 accordingly:
> 
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 93726560..9bf8882 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -159,7 +159,7 @@ void rpcrdma_event_process(struct ib_wc *wc)
>                dprintk("RPC:       %s: WC opcode %d status %X, connection lost\n",
>                        __func__, wc->opcode, wc->status);
>                rep->rr_len = ~0U;
> -               if (wc->opcode != IB_WC_FAST_REG_MR && wc->opcode != IB_WC_LOCAL_INV)
> +               if (wc->opcode == IB_WC_RECV || wc->opcode == IB_WC_BIND_MW)
>                        rpcrdma_schedule_tasklet(rep);
>                return;
>        }
> 
> 
> This seems to fix the kernel oops/panic. I guess this is not the most elegant way, and I kindly ask you to improve on this.

Hi Rafael-

I’ll take a look. Can you report your HCA and how you reproduce this issue?

Do you (or does anyone) know whether OpenSuSE uses the stock upstream providers, or has any of OFED been applied to this kernel?


> As similar error is referenced briefly by Yan Burman in this posting from 02/18/2013 12:44 PM.
> 
> Please also consider my other post (which I will post later today) about a failure to reinstate the mount after a lost connection.


--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
rafael.reiter March 27, 2014, 7:53 a.m. UTC | #2
On 03/26/2014 07:15 PM, Chuck Lever wrote:
> 
> Hi Rafael-
> 
> I’ll take a look. Can you report your HCA and how you reproduce this issue?

The HCA is Mellanox Technologies MT26428.

Reproduction:
1) Mount a directory via NFS/RDMA
mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/
2) Pull the Infiniband cable or use ibportstate to disrupt the Infiniband connection
3) ls /mnt
4) wait 5-30 seconds

> 
> Do you (or does anyone) know whether OpenSuSE uses the stock upstream providers, or has any of OFED been applied to this kernel?

My first post was misleading -- I use the vanilla 3.10.17 kernel, and the error occurs with and without Mellanox OFED. The above report
references the vanilla kernel.

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever III March 27, 2014, 3:59 p.m. UTC | #3
Hi-


On Mar 27, 2014, at 12:53 AM, Reiter Rafael <rafael.reiter@ims.co.at> wrote:

> On 03/26/2014 07:15 PM, Chuck Lever wrote:
>> 
>> Hi Rafael-
>> 
>> I’ll take a look. Can you report your HCA and how you reproduce this issue?
> 
> The HCA is Mellanox Technologies MT26428.
> 
> Reproduction:
> 1) Mount a directory via NFS/RDMA
> mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/
> 2) Pull the Infiniband cable or use ibportstate to disrupt the Infiniband connection
> 3) ls /mnt
> 4) wait 5-30 seconds

Thanks for the information.

I have that HCA, but I won’t have access to my test systems for a week (traveling). So can you try this:

 # rpcdebug -m rpc -s trans

then reproduce (starting with step 1 above). Some debugging output will appear at the tail of /var/log/messages. Copy it to this thread.

> 
>> 
>> Do you (or does anyone) know whether OpenSuSE uses the stock upstream providers, or has any of OFED been applied to this kernel?
> 
> My first post was misleading -- I use the vanilla 3.10.17 kernel, and the error occurs with and without Mellanox OFED. The above report
> references the vanilla kernel.
> 
> Rafael
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Klemens Senn March 28, 2014, 9:42 a.m. UTC | #4
Hi Chuck,

On 03/27/2014 04:59 PM, Chuck Lever wrote:
> Hi-
> 
> 
> On Mar 27, 2014, at 12:53 AM, Reiter Rafael <rafael.reiter@ims.co.at> wrote:
> 
>> On 03/26/2014 07:15 PM, Chuck Lever wrote:
>>>
>>> Hi Rafael-
>>>
>>> I’ll take a look. Can you report your HCA and how you reproduce this issue?
>>
>> The HCA is Mellanox Technologies MT26428.
>>
>> Reproduction:
>> 1) Mount a directory via NFS/RDMA
>> mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/

An additional "ls /mnt" is needed here (between step 1 and 2)

>> 2) Pull the Infiniband cable or use ibportstate to disrupt the Infiniband connection
>> 3) ls /mnt
>> 4) wait 5-30 seconds
> 
> Thanks for the information.
> 
> I have that HCA, but I won’t have access to my test systems for a week (traveling). So can you try this:
> 
>  # rpcdebug -m rpc -s trans
> 
> then reproduce (starting with step 1 above). Some debugging output will appear at the tail of /var/log/messages. Copy it to this thread.
> 

The output of /var/log/messages is:

[  143.233701] RPC:  1688 xprt_rdma_allocate: size 1112 too large for
buffer[1024]: prog 100003 vers 4 proc 1
[  143.233708] RPC:  1688 xprt_rdma_allocate: size 1112, request
0xffff88105894c000
[  143.233715] RPC:  1688 rpcrdma_inline_pullup: pad 0 destp
0xffff88105894d7dc len 124 hdrlen 124
[  143.233718] RPC:       rpcrdma_register_frmr_external: Using frmr
ffff88084e589260 to map 1 segments
[  143.233722] RPC:  1688 rpcrdma_create_chunks: reply chunk elem
652@0x105894d92c:0xced01 (last)
[  143.233725] RPC:  1688 rpcrdma_marshal_req: reply chunk: hdrlen 48
rpclen 124 padlen 0 headerp 0xffff88105894d100 base 0xffff88105894d760
lkey 0x8000
[  143.233785] RPC:       rpcrdma_event_process: event rep
ffff88084e589260 status 0 opcode 8 length 0
[  177.272397] RPC:       rpcrdma_event_process: event rep
(null) status C opcode FFFF8808 length 4294967295
[  177.272649] RPC:       rpcrdma_event_process: event rep
ffff880848ed0000 status 5 opcode FFFF8808 length 4294936584
[  177.272651] RPC:       rpcrdma_event_process: WC opcode -30712 status
5, connection lost
[  177.984996] RPC:  1689 xprt_rdma_allocate: size 436, request
0xffff880848f00000
[  182.290655] RPC:       xprt_rdma_connect_worker: reconnect
[  182.290992] RPC:       rpcrdma_ep_disconnect: after wait, disconnected
[  187.300726] RPC:       xprt_rdma_connect_worker: exit
[  197.320527] RPC:       xprt_rdma_connect_worker: reconnect
[  197.320795] RPC:       rpcrdma_ep_disconnect: after wait, disconnected
[  202.330477] RPC:       xprt_rdma_connect_worker: exit
[  222.354286] RPC:       xprt_rdma_connect_worker: reconnect
[  222.354624] RPC:       rpcrdma_ep_disconnect: after wait, disconnected


The output on the serial terminal is:

[  227.364376] kernel tried to execute NX-protected page - exploit
attempt? (uid: 0)
[  227.364517] RPC:  1689 rpcrdma_inline_pullup: pad 0 destp
0xffff880848f017c4 len 100 hdrlen 100
[  227.364519] RPC:       rpcrdma_register_frmr_external: Using frmr
ffff88084e588810 to map 1 segments
[  227.364522] RPC:  1689 rpcrdma_create_chunks: reply chunk elem
152@0x848f0187c:0xcab01 (last)
[  227.364523] RPC:  1689 rpcrdma_marshal_req: reply chunk: hdrlen 48
rpclen 100 padlen 0 headerp 0xffff880848f01100 base 0xffff880848f01760
lkey 0x8000
[  227.411547] BUG: unable to handle kernel paging request at
ffff880848ed1758
[  227.418535] IP: [<ffff880848ed1758>] 0xffff880848ed1757
[  227.423781] PGD 1d7c067 PUD 85d52f063 PMD 848f12063 PTE 8000000848ed1163
[  227.430544] Oops: 0011 [#1] SMP
[  227.433802] Modules linked in: auth_rpcgss oid_registry nfsv4
xprtrdma cpuid af_packet 8021q garp stp llc rdma_ucm ib_ucm rdma_cm
iw_cm ib_addr ib_ipoib ib_cm ib_uverbs ib_umad mlx4_en mlx4_ib ib_sa
ib_mad ib_core joydev usbhid mlx4_core iTCO_wdt iTCO_vendor_support
acpi_cpufreq coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel
aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64
microcode pcspkr sb_edac edac_core i2c_i801 isci sg libsas ehci_pci
ehci_hcd scsi_transport_sas usbcore lpc_ich ioatdma mfd_core usb_common
shpchp pci_hotplug wmi mperf processor thermal_sys button edd autofs4
xfs libcrc32c nfsv3 nfs fscache lockd nfs_acl sunrpc igb dca
i2c_algo_bit ptp pps_core
[  227.496536] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
3.10.17-allpatches+ #1
[  227.503583] Hardware name: Supermicro B9DRG-E/B9DRG-E, BIOS 3.0
09/04/2013
[  227.510451] task: ffffffff81a11440 ti: ffffffff81a00000 task.ti:
ffffffff81a00000
[  227.517924] RIP: 0010:[<ffff880848ed1758>]  [<ffff880848ed1758>]
0xffff880848ed1757
[  227.525597] RSP: 0018:ffff88087fc03e88  EFLAGS: 00010282
[  227.530903] RAX: 0000000000000286 RBX: ffff880848ed1758 RCX:
ffffffffa0354360
[  227.538032] RDX: ffff88084e589280 RSI: 0000000000000286 RDI:
ffff88084e589260
[  227.545157] RBP: ffff88087fc03ea0 R08: ffffffffa0354360 R09:
00000000000005f0
[  227.552286] R10: 0000000000000003 R11: dead000000100100 R12:
ffff88084e589260
[  227.559412] R13: 0000000000000006 R14: 0000000000000006 R15:
ffffffff81a5db90
[  227.566540] FS:  0000000000000000(0000) GS:ffff88087fc00000(0000)
knlGS:0000000000000000
[  227.574620] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  227.580360] CR2: ffff880848ed1758 CR3: 0000000001a0c000 CR4:
00000000000407f0
[  227.587487] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[  227.594614] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[  227.601742] Stack:
[  227.603754]  ffffffffa034edde ffffffffa0354388 0000000000000000
ffff88087fc03ec0
[  227.611210]  ffffffff81049c82 0000000000000001 ffffffff81a050b0
ffff88087fc03f30
[  227.618666]  ffffffff81049870 ffffffff81a01fd8 00000000ffffb895
0000000000202000
[  227.626123] Call Trace:
[  227.628569]  <IRQ>
[  227.630493]  [<ffffffffa034edde>] ? rpcrdma_run_tasklet+0x7e/0xc0
[xprtrdma]
[  227.637764]  [<ffffffff81049c82>] tasklet_action+0x52/0xc0
[  227.643249]  [<ffffffff81049870>] __do_softirq+0xe0/0x220
[  227.648643]  [<ffffffff8155cbac>] call_softirq+0x1c/0x30
[  227.653957]  [<ffffffff8100452d>] do_softirq+0x4d/0x80
[  227.659097]  [<ffffffff81049b05>] irq_exit+0x95/0xa0
[  227.664056]  [<ffffffff8100411e>] do_IRQ+0x5e/0xd0
[  227.668847]  [<ffffffff81553eaa>] common_interrupt+0x6a/0x6a
[  227.674503]  <EOI>
[  227.676429]  [<ffffffff8141de86>] ? cpuidle_enter_state+0x56/0xd0
[  227.682733]  [<ffffffff8141de82>] ? cpuidle_enter_state+0x52/0xd0
[  227.688819]  [<ffffffff8141dfb6>] cpuidle_idle_call+0xb6/0x200
[  227.694649]  [<ffffffff8100aa39>] arch_cpu_idle+0x9/0x20
[  227.699963]  [<ffffffff81087cc0>] cpu_startup_entry+0x80/0x200
[  227.705799]  [<ffffffff815358a2>] rest_init+0x72/0x80
[  227.710854]  [<ffffffff81ac4e28>] start_kernel+0x3b2/0x3bf
[  227.716341]  [<ffffffff81ac4875>] ? repair_env_string+0x5e/0x5e
[  227.722261]  [<ffffffff81ac45a5>] x86_64_start_reservations+0x2a/0x2c
[  227.728696]  [<ffffffff81ac4675>] x86_64_start_kernel+0xce/0xd2
[  227.734607] Code: 88 ff ff 07 a4 34 a0 ff ff ff ff 00 10 91 7b 10 88
ff ff 26 00 26 00 10 00 00 00 00 01 00 00 00 00 00 00 20 00 00 00 00 00
00 00 <00> 88 58 4e 08 88 ff ff 78 92 58 4e 08 88 ff ff 01 00 00 00 00
[  227.754568] RIP  [<ffff880848ed1758>] 0xffff880848ed1757
[  227.759898]  RSP <ffff88087fc03e88>
[  227.763386] CR2: ffff880848ed1758
[  227.766699] ---[ end trace 264f2f21a7499134 ]---
[  227.772473] Kernel panic - not syncing: Fatal exception in interrupt
[  227.779954] ------------[ cut here ]------------
[  227.784570] WARNING: at arch/x86/kernel/smp.c:123
native_smp_send_reschedule+0x56/0x60()
[  227.792657] Modules linked in: auth_rpcgss oid_registry nfsv4
xprtrdma cpuid af_packet 8021q garp stp llc rdma_ucm ib_ucm rdma_cm
iw_cm ib_addr ib_ipoib ib_cm ib_uverbs ib_umad mlx4_en mlx4_ib ib_sa
ib_mad ib_core joydev usbhid mlx4_core iTCO_wdt iTCO_vendor_support
acpi_cpufreq coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel
aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64
microcode pcspkr sb_edac edac_core i2c_i801 isci sg libsas ehci_pci
ehci_hcd scsi_transport_sas usbcore lpc_ich ioatdma mfd_core usb_common
shpchp pci_hotplug wmi mperf processor thermal_sys button edd autofs4
xfs libcrc32c nfsv3 nfs fscache lockd nfs_acl sunrpc igb dca
i2c_algo_bit ptp pps_core
[  227.855155] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G      D
3.10.17-allpatches+ #1
[  227.863147] Hardware name: Supermicro B9DRG-E/B9DRG-E, BIOS 3.0
09/04/2013
[  227.870014]  0000000000000009 ffff88087fc03850 ffffffff8154de8a
ffff88087fc03890
[  227.877471]  ffffffff81041d3b ffff88105d9bff30 0000000000000001
0000000000000001
[  227.884927]  00000000ffffb8fb ffff88087fc33380 0000000000000000
ffff88087fc038a0
[  227.892385] Call Trace:
[  227.894830]  <IRQ>  [<ffffffff8154de8a>] dump_stack+0x19/0x1b
[  227.900597]  [<ffffffff81041d3b>] warn_slowpath_common+0x6b/0xa0
[  227.906596]  [<ffffffff81041d85>] warn_slowpath_null+0x15/0x20
[  227.912422]  [<ffffffff81024046>] native_smp_send_reschedule+0x56/0x60
[  227.918944]  [<ffffffff8107f084>] trigger_load_balance+0x174/0x210
[  227.925127]  [<ffffffff81073715>] scheduler_tick+0x105/0x140
[  227.930789]  [<ffffffff81051e62>] update_process_times+0x62/0x80
[  227.936796]  [<ffffffff81091219>] tick_sched_handle.isra.10+0x29/0x70
[  227.943238]  [<ffffffff810913a7>] tick_sched_timer+0x47/0x70
[  227.948891]  [<ffffffff81068bea>] __run_hrtimer+0x6a/0x1b0
[  227.954371]  [<ffffffff81091360>] ? tick_sched_do_timer+0x30/0x30
[  227.960467]  [<ffffffff810694df>] hrtimer_interrupt+0xff/0x230
[  227.966303]  [<ffffffff810250a4>] smp_apic_timer_interrupt+0x64/0xa0
[  227.972658]  [<ffffffff8155c53a>] apic_timer_interrupt+0x6a/0x70
[  227.978665]  [<ffffffff8154980f>] ? panic+0x181/0x1bf
[  227.983712]  [<ffffffff81549779>] ? panic+0xeb/0x1bf
[  227.988681]  [<ffffffff81554bca>] oops_end+0xda/0xe0
[  227.993647]  [<ffffffff815490f9>] no_context+0x24e/0x279
[  227.998955]  [<ffffffff815492d9>] __bad_area_nosemaphore+0x1b5/0x1d4
[  228.005310]  [<ffffffff8104394f>] ? wake_up_klogd+0x2f/0x40
[  228.010876]  [<ffffffff81549306>] bad_area_nosemaphore+0xe/0x10
[  228.016791]  [<ffffffff81557406>] __do_page_fault+0x396/0x4e0
[  228.022548]  [<ffffffffa02939d1>] ? mlx4_ib_poll_cq+0x91/0xe0 [mlx4_ib]
[  228.029164]  [<ffffffff81557559>] do_page_fault+0x9/0x10
[  228.034477]  [<ffffffff815540b2>] page_fault+0x22/0x30
[  228.039611]  [<ffffffffa034edde>] ? rpcrdma_run_tasklet+0x7e/0xc0
[xprtrdma]
[  228.046659]  [<ffffffff81049c82>] tasklet_action+0x52/0xc0
[  228.052140]  [<ffffffff81049870>] __do_softirq+0xe0/0x220
[  228.057542]  [<ffffffff8155cbac>] call_softirq+0x1c/0x30
[  228.062856]  [<ffffffff8100452d>] do_softirq+0x4d/0x80
[  228.067989]  [<ffffffff81049b05>] irq_exit+0x95/0xa0
[  228.072958]  [<ffffffff8100411e>] do_IRQ+0x5e/0xd0
[  228.077752]  [<ffffffff81553eaa>] common_interrupt+0x6a/0x6a
[  228.083411]  <EOI>  [<ffffffff8141de86>] ? cpuidle_enter_state+0x56/0xd0
[  228.090133]  [<ffffffff8141de82>] ? cpuidle_enter_state+0x52/0xd0
[  228.096220]  [<ffffffff8141dfb6>] cpuidle_idle_call+0xb6/0x200
[  228.102055]  [<ffffffff8100aa39>] arch_cpu_idle+0x9/0x20
[  228.107362]  [<ffffffff81087cc0>] cpu_startup_entry+0x80/0x200
[  228.113197]  [<ffffffff815358a2>] rest_init+0x72/0x80
[  228.118250]  [<ffffffff81ac4e28>] start_kernel+0x3b2/0x3bf
[  228.123731]  [<ffffffff81ac4875>] ? repair_env_string+0x5e/0x5e
[  228.129644]  [<ffffffff81ac45a5>] x86_64_start_reservations+0x2a/0x2c
[  228.136078]  [<ffffffff81ac4675>] x86_64_start_kernel+0xce/0xd2
[  228.141998] ---[ end trace 264f2f21a7499135 ]---
[  228.146625] RPC:       rpcrdma_event_process: event rep
ffff88084e588810 status 5 opcode 0 length 4294936584
[  228.156443] RPC:       rpcrdma_event_process: WC opcode 0 status 5,
connection lost
[  228.164091] RPC:       rpcrdma_event_process: event rep
(null) status 5 opcode 0 length 4294936584
[  228.173906] RPC:       rpcrdma_event_process: event rep
ffff880848ed0000 status 5 opcode 0 length 4294936584
[  228.183721] RPC:       rpcrdma_event_process: WC opcode 0 status 5,
connection lost

>>
>>>
>>> Do you (or does anyone) know whether OpenSuSE uses the stock upstream providers, or has any of OFED been applied to this kernel?
>>
>> My first post was misleading -- I use the vanilla 3.10.17 kernel, and the error occurs with and without Mellanox OFED. The above report
>> references the vanilla kernel.
>>
>> Rafael
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Regards,
Klemens


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever III March 28, 2014, 10:30 p.m. UTC | #5
On Mar 28, 2014, at 2:42 AM, Senn Klemens <klemens.senn@ims.co.at> wrote:

> Hi Chuck,
> 
> On 03/27/2014 04:59 PM, Chuck Lever wrote:
>> Hi-
>> 
>> 
>> On Mar 27, 2014, at 12:53 AM, Reiter Rafael <rafael.reiter@ims.co.at> wrote:
>> 
>>> On 03/26/2014 07:15 PM, Chuck Lever wrote:
>>>> 
>>>> Hi Rafael-
>>>> 
>>>> I’ll take a look. Can you report your HCA and how you reproduce this issue?
>>> 
>>> The HCA is Mellanox Technologies MT26428.
>>> 
>>> Reproduction:
>>> 1) Mount a directory via NFS/RDMA
>>> mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/
> 
> An additional "ls /mnt" is needed here (between step 1 and 2)
> 
>>> 2) Pull the Infiniband cable or use ibportstate to disrupt the Infiniband connection
>>> 3) ls /mnt
>>> 4) wait 5-30 seconds
>> 
>> Thanks for the information.
>> 
>> I have that HCA, but I won’t have access to my test systems for a week (traveling). So can you try this:
>> 
>> # rpcdebug -m rpc -s trans
>> 
>> then reproduce (starting with step 1 above). Some debugging output will appear at the tail of /var/log/messages. Copy it to this thread.
>> 
> 
> The output of /var/log/messages is:
> 
> [  143.233701] RPC:  1688 xprt_rdma_allocate: size 1112 too large for
> buffer[1024]: prog 100003 vers 4 proc 1
> [  143.233708] RPC:  1688 xprt_rdma_allocate: size 1112, request
> 0xffff88105894c000
> [  143.233715] RPC:  1688 rpcrdma_inline_pullup: pad 0 destp
> 0xffff88105894d7dc len 124 hdrlen 124
> [  143.233718] RPC:       rpcrdma_register_frmr_external: Using frmr
> ffff88084e589260 to map 1 segments
> [  143.233722] RPC:  1688 rpcrdma_create_chunks: reply chunk elem
> 652@0x105894d92c:0xced01 (last)
> [  143.233725] RPC:  1688 rpcrdma_marshal_req: reply chunk: hdrlen 48
> rpclen 124 padlen 0 headerp 0xffff88105894d100 base 0xffff88105894d760
> lkey 0x8000
> [  143.233785] RPC:       rpcrdma_event_process: event rep
> ffff88084e589260 status 0 opcode 8 length 0
> [  177.272397] RPC:       rpcrdma_event_process: event rep
> (null) status C opcode FFFF8808 length 4294967295
> [  177.272649] RPC:       rpcrdma_event_process: event rep
> ffff880848ed0000 status 5 opcode FFFF8808 length 4294936584

The mlx4 provider is returning a WC completion status of
IB_WC_WR_FLUSH_ERR.

> [  177.272651] RPC:       rpcrdma_event_process: WC opcode -30712 status
> 5, connection lost

-30712 is a bogus WC opcode. So the mlx4 provider is not filling in the
WC opcode. rpcrdma_event_process() thus can’t depend on the contents of
the ib_wc.opcode field when the WC completion status != IB_WC_SUCCESS.

A copy of the opcode reachable from the incoming rpcrdma_rep could be
added, initialized in the forward paths. rpcrdma_event_process() could
use the copy in the error case.


> [  177.984996] RPC:  1689 xprt_rdma_allocate: size 436, request
> 0xffff880848f00000
> [  182.290655] RPC:       xprt_rdma_connect_worker: reconnect
> [  182.290992] RPC:       rpcrdma_ep_disconnect: after wait, disconnected
> [  187.300726] RPC:       xprt_rdma_connect_worker: exit
> [  197.320527] RPC:       xprt_rdma_connect_worker: reconnect
> [  197.320795] RPC:       rpcrdma_ep_disconnect: after wait, disconnected
> [  202.330477] RPC:       xprt_rdma_connect_worker: exit
> [  222.354286] RPC:       xprt_rdma_connect_worker: reconnect
> [  222.354624] RPC:       rpcrdma_ep_disconnect: after wait, disconnected
> 
> 
> The output on the serial terminal is:
> 
> [  227.364376] kernel tried to execute NX-protected page - exploit
> attempt? (uid: 0)
> [  227.364517] RPC:  1689 rpcrdma_inline_pullup: pad 0 destp
> 0xffff880848f017c4 len 100 hdrlen 100
> [  227.364519] RPC:       rpcrdma_register_frmr_external: Using frmr
> ffff88084e588810 to map 1 segments
> [  227.364522] RPC:  1689 rpcrdma_create_chunks: reply chunk elem
> 152@0x848f0187c:0xcab01 (last)
> [  227.364523] RPC:  1689 rpcrdma_marshal_req: reply chunk: hdrlen 48
> rpclen 100 padlen 0 headerp 0xffff880848f01100 base 0xffff880848f01760
> lkey 0x8000
> [  227.411547] BUG: unable to handle kernel paging request at
> ffff880848ed1758
> [  227.418535] IP: [<ffff880848ed1758>] 0xffff880848ed1757
> [  227.423781] PGD 1d7c067 PUD 85d52f063 PMD 848f12063 PTE 8000000848ed1163
> [  227.430544] Oops: 0011 [#1] SMP
> [  227.433802] Modules linked in: auth_rpcgss oid_registry nfsv4
> xprtrdma cpuid af_packet 8021q garp stp llc rdma_ucm ib_ucm rdma_cm
> iw_cm ib_addr ib_ipoib ib_cm ib_uverbs ib_umad mlx4_en mlx4_ib ib_sa
> ib_mad ib_core joydev usbhid mlx4_core iTCO_wdt iTCO_vendor_support
> acpi_cpufreq coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel
> aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64
> microcode pcspkr sb_edac edac_core i2c_i801 isci sg libsas ehci_pci
> ehci_hcd scsi_transport_sas usbcore lpc_ich ioatdma mfd_core usb_common
> shpchp pci_hotplug wmi mperf processor thermal_sys button edd autofs4
> xfs libcrc32c nfsv3 nfs fscache lockd nfs_acl sunrpc igb dca
> i2c_algo_bit ptp pps_core
> [  227.496536] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
> 3.10.17-allpatches+ #1
> [  227.503583] Hardware name: Supermicro B9DRG-E/B9DRG-E, BIOS 3.0
> 09/04/2013
> [  227.510451] task: ffffffff81a11440 ti: ffffffff81a00000 task.ti:
> ffffffff81a00000
> [  227.517924] RIP: 0010:[<ffff880848ed1758>]  [<ffff880848ed1758>]
> 0xffff880848ed1757
> [  227.525597] RSP: 0018:ffff88087fc03e88  EFLAGS: 00010282
> [  227.530903] RAX: 0000000000000286 RBX: ffff880848ed1758 RCX:
> ffffffffa0354360
> [  227.538032] RDX: ffff88084e589280 RSI: 0000000000000286 RDI:
> ffff88084e589260
> [  227.545157] RBP: ffff88087fc03ea0 R08: ffffffffa0354360 R09:
> 00000000000005f0
> [  227.552286] R10: 0000000000000003 R11: dead000000100100 R12:
> ffff88084e589260
> [  227.559412] R13: 0000000000000006 R14: 0000000000000006 R15:
> ffffffff81a5db90
> [  227.566540] FS:  0000000000000000(0000) GS:ffff88087fc00000(0000)
> knlGS:0000000000000000
> [  227.574620] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  227.580360] CR2: ffff880848ed1758 CR3: 0000000001a0c000 CR4:
> 00000000000407f0
> [  227.587487] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [  227.594614] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [  227.601742] Stack:
> [  227.603754]  ffffffffa034edde ffffffffa0354388 0000000000000000
> ffff88087fc03ec0
> [  227.611210]  ffffffff81049c82 0000000000000001 ffffffff81a050b0
> ffff88087fc03f30
> [  227.618666]  ffffffff81049870 ffffffff81a01fd8 00000000ffffb895
> 0000000000202000
> [  227.626123] Call Trace:
> [  227.628569]  <IRQ>
> [  227.630493]  [<ffffffffa034edde>] ? rpcrdma_run_tasklet+0x7e/0xc0
> [xprtrdma]
> [  227.637764]  [<ffffffff81049c82>] tasklet_action+0x52/0xc0
> [  227.643249]  [<ffffffff81049870>] __do_softirq+0xe0/0x220
> [  227.648643]  [<ffffffff8155cbac>] call_softirq+0x1c/0x30
> [  227.653957]  [<ffffffff8100452d>] do_softirq+0x4d/0x80
> [  227.659097]  [<ffffffff81049b05>] irq_exit+0x95/0xa0
> [  227.664056]  [<ffffffff8100411e>] do_IRQ+0x5e/0xd0
> [  227.668847]  [<ffffffff81553eaa>] common_interrupt+0x6a/0x6a
> [  227.674503]  <EOI>
> [  227.676429]  [<ffffffff8141de86>] ? cpuidle_enter_state+0x56/0xd0
> [  227.682733]  [<ffffffff8141de82>] ? cpuidle_enter_state+0x52/0xd0
> [  227.688819]  [<ffffffff8141dfb6>] cpuidle_idle_call+0xb6/0x200
> [  227.694649]  [<ffffffff8100aa39>] arch_cpu_idle+0x9/0x20
> [  227.699963]  [<ffffffff81087cc0>] cpu_startup_entry+0x80/0x200
> [  227.705799]  [<ffffffff815358a2>] rest_init+0x72/0x80
> [  227.710854]  [<ffffffff81ac4e28>] start_kernel+0x3b2/0x3bf
> [  227.716341]  [<ffffffff81ac4875>] ? repair_env_string+0x5e/0x5e
> [  227.722261]  [<ffffffff81ac45a5>] x86_64_start_reservations+0x2a/0x2c
> [  227.728696]  [<ffffffff81ac4675>] x86_64_start_kernel+0xce/0xd2
> [  227.734607] Code: 88 ff ff 07 a4 34 a0 ff ff ff ff 00 10 91 7b 10 88
> ff ff 26 00 26 00 10 00 00 00 00 01 00 00 00 00 00 00 20 00 00 00 00 00
> 00 00 <00> 88 58 4e 08 88 ff ff 78 92 58 4e 08 88 ff ff 01 00 00 00 00
> [  227.754568] RIP  [<ffff880848ed1758>] 0xffff880848ed1757
> [  227.759898]  RSP <ffff88087fc03e88>
> [  227.763386] CR2: ffff880848ed1758
> [  227.766699] ---[ end trace 264f2f21a7499134 ]---
> [  227.772473] Kernel panic - not syncing: Fatal exception in interrupt
> [  227.779954] ------------[ cut here ]------------
> [  227.784570] WARNING: at arch/x86/kernel/smp.c:123
> native_smp_send_reschedule+0x56/0x60()
> [  227.792657] Modules linked in: auth_rpcgss oid_registry nfsv4
> xprtrdma cpuid af_packet 8021q garp stp llc rdma_ucm ib_ucm rdma_cm
> iw_cm ib_addr ib_ipoib ib_cm ib_uverbs ib_umad mlx4_en mlx4_ib ib_sa
> ib_mad ib_core joydev usbhid mlx4_core iTCO_wdt iTCO_vendor_support
> acpi_cpufreq coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel
> aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64
> microcode pcspkr sb_edac edac_core i2c_i801 isci sg libsas ehci_pci
> ehci_hcd scsi_transport_sas usbcore lpc_ich ioatdma mfd_core usb_common
> shpchp pci_hotplug wmi mperf processor thermal_sys button edd autofs4
> xfs libcrc32c nfsv3 nfs fscache lockd nfs_acl sunrpc igb dca
> i2c_algo_bit ptp pps_core
> [  227.855155] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G      D
> 3.10.17-allpatches+ #1
> [  227.863147] Hardware name: Supermicro B9DRG-E/B9DRG-E, BIOS 3.0
> 09/04/2013
> [  227.870014]  0000000000000009 ffff88087fc03850 ffffffff8154de8a
> ffff88087fc03890
> [  227.877471]  ffffffff81041d3b ffff88105d9bff30 0000000000000001
> 0000000000000001
> [  227.884927]  00000000ffffb8fb ffff88087fc33380 0000000000000000
> ffff88087fc038a0
> [  227.892385] Call Trace:
> [  227.894830]  <IRQ>  [<ffffffff8154de8a>] dump_stack+0x19/0x1b
> [  227.900597]  [<ffffffff81041d3b>] warn_slowpath_common+0x6b/0xa0
> [  227.906596]  [<ffffffff81041d85>] warn_slowpath_null+0x15/0x20
> [  227.912422]  [<ffffffff81024046>] native_smp_send_reschedule+0x56/0x60
> [  227.918944]  [<ffffffff8107f084>] trigger_load_balance+0x174/0x210
> [  227.925127]  [<ffffffff81073715>] scheduler_tick+0x105/0x140
> [  227.930789]  [<ffffffff81051e62>] update_process_times+0x62/0x80
> [  227.936796]  [<ffffffff81091219>] tick_sched_handle.isra.10+0x29/0x70
> [  227.943238]  [<ffffffff810913a7>] tick_sched_timer+0x47/0x70
> [  227.948891]  [<ffffffff81068bea>] __run_hrtimer+0x6a/0x1b0
> [  227.954371]  [<ffffffff81091360>] ? tick_sched_do_timer+0x30/0x30
> [  227.960467]  [<ffffffff810694df>] hrtimer_interrupt+0xff/0x230
> [  227.966303]  [<ffffffff810250a4>] smp_apic_timer_interrupt+0x64/0xa0
> [  227.972658]  [<ffffffff8155c53a>] apic_timer_interrupt+0x6a/0x70
> [  227.978665]  [<ffffffff8154980f>] ? panic+0x181/0x1bf
> [  227.983712]  [<ffffffff81549779>] ? panic+0xeb/0x1bf
> [  227.988681]  [<ffffffff81554bca>] oops_end+0xda/0xe0
> [  227.993647]  [<ffffffff815490f9>] no_context+0x24e/0x279
> [  227.998955]  [<ffffffff815492d9>] __bad_area_nosemaphore+0x1b5/0x1d4
> [  228.005310]  [<ffffffff8104394f>] ? wake_up_klogd+0x2f/0x40
> [  228.010876]  [<ffffffff81549306>] bad_area_nosemaphore+0xe/0x10
> [  228.016791]  [<ffffffff81557406>] __do_page_fault+0x396/0x4e0
> [  228.022548]  [<ffffffffa02939d1>] ? mlx4_ib_poll_cq+0x91/0xe0 [mlx4_ib]
> [  228.029164]  [<ffffffff81557559>] do_page_fault+0x9/0x10
> [  228.034477]  [<ffffffff815540b2>] page_fault+0x22/0x30
> [  228.039611]  [<ffffffffa034edde>] ? rpcrdma_run_tasklet+0x7e/0xc0
> [xprtrdma]
> [  228.046659]  [<ffffffff81049c82>] tasklet_action+0x52/0xc0
> [  228.052140]  [<ffffffff81049870>] __do_softirq+0xe0/0x220
> [  228.057542]  [<ffffffff8155cbac>] call_softirq+0x1c/0x30
> [  228.062856]  [<ffffffff8100452d>] do_softirq+0x4d/0x80
> [  228.067989]  [<ffffffff81049b05>] irq_exit+0x95/0xa0
> [  228.072958]  [<ffffffff8100411e>] do_IRQ+0x5e/0xd0
> [  228.077752]  [<ffffffff81553eaa>] common_interrupt+0x6a/0x6a
> [  228.083411]  <EOI>  [<ffffffff8141de86>] ? cpuidle_enter_state+0x56/0xd0
> [  228.090133]  [<ffffffff8141de82>] ? cpuidle_enter_state+0x52/0xd0
> [  228.096220]  [<ffffffff8141dfb6>] cpuidle_idle_call+0xb6/0x200
> [  228.102055]  [<ffffffff8100aa39>] arch_cpu_idle+0x9/0x20
> [  228.107362]  [<ffffffff81087cc0>] cpu_startup_entry+0x80/0x200
> [  228.113197]  [<ffffffff815358a2>] rest_init+0x72/0x80
> [  228.118250]  [<ffffffff81ac4e28>] start_kernel+0x3b2/0x3bf
> [  228.123731]  [<ffffffff81ac4875>] ? repair_env_string+0x5e/0x5e
> [  228.129644]  [<ffffffff81ac45a5>] x86_64_start_reservations+0x2a/0x2c
> [  228.136078]  [<ffffffff81ac4675>] x86_64_start_kernel+0xce/0xd2
> [  228.141998] ---[ end trace 264f2f21a7499135 ]---
> [  228.146625] RPC:       rpcrdma_event_process: event rep
> ffff88084e588810 status 5 opcode 0 length 4294936584
> [  228.156443] RPC:       rpcrdma_event_process: WC opcode 0 status 5,
> connection lost
> [  228.164091] RPC:       rpcrdma_event_process: event rep
> (null) status 5 opcode 0 length 4294936584
> [  228.173906] RPC:       rpcrdma_event_process: event rep
> ffff880848ed0000 status 5 opcode 0 length 4294936584
> [  228.183721] RPC:       rpcrdma_event_process: WC opcode 0 status 5,
> connection lost
> 
>>> 
>>>> 
>>>> Do you (or does anyone) know whether OpenSuSE uses the stock upstream providers, or has any of OFED been applied to this kernel?
>>> 
>>> My first post was misleading -- I use the vanilla 3.10.17 kernel, and the error occurs with and without Mellanox OFED. The above report
>>> references the vanilla kernel.
>>> 
>>> Rafael
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> --
>> Chuck Lever
>> chuck[dot]lever[at]oracle[dot]com
>> 
>> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
> 
> Regards,
> Klemens
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg March 28, 2014, 11:06 p.m. UTC | #6
On 3/29/2014 1:30 AM, Chuck Lever wrote:
> On Mar 28, 2014, at 2:42 AM, Senn Klemens <klemens.senn@ims.co.at> wrote:
>
>> Hi Chuck,
>>
>> On 03/27/2014 04:59 PM, Chuck Lever wrote:
>>> Hi-
>>>
>>>
>>> On Mar 27, 2014, at 12:53 AM, Reiter Rafael <rafael.reiter@ims.co.at> wrote:
>>>
>>>> On 03/26/2014 07:15 PM, Chuck Lever wrote:
>>>>> Hi Rafael-
>>>>>
>>>>> I’ll take a look. Can you report your HCA and how you reproduce this issue?
>>>> The HCA is Mellanox Technologies MT26428.
>>>>
>>>> Reproduction:
>>>> 1) Mount a directory via NFS/RDMA
>>>> mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/
>> An additional "ls /mnt" is needed here (between step 1 and 2)
>>
>>>> 2) Pull the Infiniband cable or use ibportstate to disrupt the Infiniband connection
>>>> 3) ls /mnt
>>>> 4) wait 5-30 seconds
>>> Thanks for the information.
>>>
>>> I have that HCA, but I won’t have access to my test systems for a week (traveling). So can you try this:
>>>
>>> # rpcdebug -m rpc -s trans
>>>
>>> then reproduce (starting with step 1 above). Some debugging output will appear at the tail of /var/log/messages. Copy it to this thread.
>>>
>> The output of /var/log/messages is:
>>
>> [  143.233701] RPC:  1688 xprt_rdma_allocate: size 1112 too large for
>> buffer[1024]: prog 100003 vers 4 proc 1
>> [  143.233708] RPC:  1688 xprt_rdma_allocate: size 1112, request
>> 0xffff88105894c000
>> [  143.233715] RPC:  1688 rpcrdma_inline_pullup: pad 0 destp
>> 0xffff88105894d7dc len 124 hdrlen 124
>> [  143.233718] RPC:       rpcrdma_register_frmr_external: Using frmr
>> ffff88084e589260 to map 1 segments
>> [  143.233722] RPC:  1688 rpcrdma_create_chunks: reply chunk elem
>> 652@0x105894d92c:0xced01 (last)
>> [  143.233725] RPC:  1688 rpcrdma_marshal_req: reply chunk: hdrlen 48
>> rpclen 124 padlen 0 headerp 0xffff88105894d100 base 0xffff88105894d760
>> lkey 0x8000
>> [  143.233785] RPC:       rpcrdma_event_process: event rep
>> ffff88084e589260 status 0 opcode 8 length 0
>> [  177.272397] RPC:       rpcrdma_event_process: event rep
>> (null) status C opcode FFFF8808 length 4294967295
>> [  177.272649] RPC:       rpcrdma_event_process: event rep
>> ffff880848ed0000 status 5 opcode FFFF8808 length 4294936584
> The mlx4 provider is returning a WC completion status of
> IB_WC_WR_FLUSH_ERR.
>
>> [  177.272651] RPC:       rpcrdma_event_process: WC opcode -30712 status
>> 5, connection lost
> -30712 is a bogus WC opcode. So the mlx4 provider is not filling in the
> WC opcode. rpcrdma_event_process() thus can’t depend on the contents of
> the ib_wc.opcode field when the WC completion status != IB_WC_SUCCESS.

Hey Chuck,

That is correct, the opcode field in the wc is not reliable in FLUSH errors.

>
> A copy of the opcode reachable from the incoming rpcrdma_rep could be
> added, initialized in the forward paths. rpcrdma_event_process() could
> use the copy in the error case.

How about suppressing completions alltogether for fast_reg and local_inv 
work requests?
if these shall fail you will get an error completion and the QP will 
transition to error state
generating FLUSH_ERR completions for all pending WRs. In this case, you 
can just ignore
flush fast_reg + local_inv errors.

see http://marc.info/?l=linux-rdma&m=139047309831997&w=2

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever III March 29, 2014, 12:05 a.m. UTC | #7
On Mar 28, 2014, at 4:06 PM, sagi grimberg <sagig@mellanox.com> wrote:

> On 3/29/2014 1:30 AM, Chuck Lever wrote:
>> On Mar 28, 2014, at 2:42 AM, Senn Klemens <klemens.senn@ims.co.at> wrote:
>> 
>>> Hi Chuck,
>>> 
>>> On 03/27/2014 04:59 PM, Chuck Lever wrote:
>>>> Hi-
>>>> 
>>>> 
>>>> On Mar 27, 2014, at 12:53 AM, Reiter Rafael <rafael.reiter@ims.co.at> wrote:
>>>> 
>>>>> On 03/26/2014 07:15 PM, Chuck Lever wrote:
>>>>>> Hi Rafael-
>>>>>> 
>>>>>> I’ll take a look. Can you report your HCA and how you reproduce this issue?
>>>>> The HCA is Mellanox Technologies MT26428.
>>>>> 
>>>>> Reproduction:
>>>>> 1) Mount a directory via NFS/RDMA
>>>>> mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/
>>> An additional "ls /mnt" is needed here (between step 1 and 2)
>>> 
>>>>> 2) Pull the Infiniband cable or use ibportstate to disrupt the Infiniband connection
>>>>> 3) ls /mnt
>>>>> 4) wait 5-30 seconds
>>>> Thanks for the information.
>>>> 
>>>> I have that HCA, but I won’t have access to my test systems for a week (traveling). So can you try this:
>>>> 
>>>> # rpcdebug -m rpc -s trans
>>>> 
>>>> then reproduce (starting with step 1 above). Some debugging output will appear at the tail of /var/log/messages. Copy it to this thread.
>>>> 
>>> The output of /var/log/messages is:
>>> 
>>> [  143.233701] RPC:  1688 xprt_rdma_allocate: size 1112 too large for
>>> buffer[1024]: prog 100003 vers 4 proc 1
>>> [  143.233708] RPC:  1688 xprt_rdma_allocate: size 1112, request
>>> 0xffff88105894c000
>>> [  143.233715] RPC:  1688 rpcrdma_inline_pullup: pad 0 destp
>>> 0xffff88105894d7dc len 124 hdrlen 124
>>> [  143.233718] RPC:       rpcrdma_register_frmr_external: Using frmr
>>> ffff88084e589260 to map 1 segments
>>> [  143.233722] RPC:  1688 rpcrdma_create_chunks: reply chunk elem
>>> 652@0x105894d92c:0xced01 (last)
>>> [  143.233725] RPC:  1688 rpcrdma_marshal_req: reply chunk: hdrlen 48
>>> rpclen 124 padlen 0 headerp 0xffff88105894d100 base 0xffff88105894d760
>>> lkey 0x8000
>>> [  143.233785] RPC:       rpcrdma_event_process: event rep
>>> ffff88084e589260 status 0 opcode 8 length 0
>>> [  177.272397] RPC:       rpcrdma_event_process: event rep
>>> (null) status C opcode FFFF8808 length 4294967295
>>> [  177.272649] RPC:       rpcrdma_event_process: event rep
>>> ffff880848ed0000 status 5 opcode FFFF8808 length 4294936584
>> The mlx4 provider is returning a WC completion status of
>> IB_WC_WR_FLUSH_ERR.
>> 
>>> [  177.272651] RPC:       rpcrdma_event_process: WC opcode -30712 status
>>> 5, connection lost
>> -30712 is a bogus WC opcode. So the mlx4 provider is not filling in the
>> WC opcode. rpcrdma_event_process() thus can’t depend on the contents of
>> the ib_wc.opcode field when the WC completion status != IB_WC_SUCCESS.
> 
> Hey Chuck,
> 
> That is correct, the opcode field in the wc is not reliable in FLUSH errors.
> 
>> 
>> A copy of the opcode reachable from the incoming rpcrdma_rep could be
>> added, initialized in the forward paths. rpcrdma_event_process() could
>> use the copy in the error case.
> 
> How about suppressing completions alltogether for fast_reg and local_inv work requests?
> if these shall fail you will get an error completion and the QP will transition to error state
> generating FLUSH_ERR completions for all pending WRs. In this case, you can just ignore
> flush fast_reg + local_inv errors.
> 
> see http://marc.info/?l=linux-rdma&m=139047309831997&w=2

While considering your suggestion, I see that my proposed fix doesn’t work. In the FAST_REG_MR and LOCAL_INV cases, wr_id points to a struct rpcrdma_mw, not a struct rpcrdma_rep. Putting a copy of the opcode in rpcrdma_rep would have no effect. Worse:

>  158         if (IB_WC_SUCCESS != wc->status) {
>  159                 dprintk("RPC:       %s: WC opcode %d status %X, connection lost\n",
>  160                         __func__, wc->opcode, wc->status);
>  161                 rep->rr_len = ~0U;

Suppose this is an IB_WC_FAST_REG_MR completion, so “rep” here is actually a struct rpcrdma_mw, not a struct rpcrdma_rep. Line 161 pokes 32 one-bits at the top of that struct rpcrdma_mw. If wc->opcode was always usable, we’d at least have to fix that.

>  162                 if (wc->opcode != IB_WC_FAST_REG_MR && wc->opcode != IB_WC_LOCAL_INV)
>  163                         rpcrdma_schedule_tasklet(rep);
>  164                 return;
>  165         }
>  166 
>  167         switch (wc->opcode) {
>  168         case IB_WC_FAST_REG_MR:
>  169                 frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
>  170                 frmr->r.frmr.state = FRMR_IS_VALID;
>  171                 break;


To make my initial solution work, you’d have to add a field to both struct rpcrdma_mw and struct rpcrdma_rep, and ensure they are at the same offset in both structures. Ewe.

Eliminating completions for FAST_REG_MR and LOCAL_INV might be a preferable way to address this.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg March 29, 2014, 12:52 a.m. UTC | #8
On 3/29/2014 3:05 AM, Chuck Lever wrote:
> On Mar 28, 2014, at 4:06 PM, sagi grimberg <sagig@mellanox.com> wrote:
>
>> On 3/29/2014 1:30 AM, Chuck Lever wrote:
>>> On Mar 28, 2014, at 2:42 AM, Senn Klemens <klemens.senn@ims.co.at> wrote:
>>>
>>>> Hi Chuck,
>>>>
>>>> On 03/27/2014 04:59 PM, Chuck Lever wrote:
>>>>> Hi-
>>>>>
>>>>>
>>>>> On Mar 27, 2014, at 12:53 AM, Reiter Rafael <rafael.reiter@ims.co.at> wrote:
>>>>>
>>>>>> On 03/26/2014 07:15 PM, Chuck Lever wrote:
>>>>>>> Hi Rafael-
>>>>>>>
>>>>>>> I’ll take a look. Can you report your HCA and how you reproduce this issue?
>>>>>> The HCA is Mellanox Technologies MT26428.
>>>>>>
>>>>>> Reproduction:
>>>>>> 1) Mount a directory via NFS/RDMA
>>>>>> mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/
>>>> An additional "ls /mnt" is needed here (between step 1 and 2)
>>>>
>>>>>> 2) Pull the Infiniband cable or use ibportstate to disrupt the Infiniband connection
>>>>>> 3) ls /mnt
>>>>>> 4) wait 5-30 seconds
>>>>> Thanks for the information.
>>>>>
>>>>> I have that HCA, but I won’t have access to my test systems for a week (traveling). So can you try this:
>>>>>
>>>>> # rpcdebug -m rpc -s trans
>>>>>
>>>>> then reproduce (starting with step 1 above). Some debugging output will appear at the tail of /var/log/messages. Copy it to this thread.
>>>>>
>>>> The output of /var/log/messages is:
>>>>
>>>> [  143.233701] RPC:  1688 xprt_rdma_allocate: size 1112 too large for
>>>> buffer[1024]: prog 100003 vers 4 proc 1
>>>> [  143.233708] RPC:  1688 xprt_rdma_allocate: size 1112, request
>>>> 0xffff88105894c000
>>>> [  143.233715] RPC:  1688 rpcrdma_inline_pullup: pad 0 destp
>>>> 0xffff88105894d7dc len 124 hdrlen 124
>>>> [  143.233718] RPC:       rpcrdma_register_frmr_external: Using frmr
>>>> ffff88084e589260 to map 1 segments
>>>> [  143.233722] RPC:  1688 rpcrdma_create_chunks: reply chunk elem
>>>> 652@0x105894d92c:0xced01 (last)
>>>> [  143.233725] RPC:  1688 rpcrdma_marshal_req: reply chunk: hdrlen 48
>>>> rpclen 124 padlen 0 headerp 0xffff88105894d100 base 0xffff88105894d760
>>>> lkey 0x8000
>>>> [  143.233785] RPC:       rpcrdma_event_process: event rep
>>>> ffff88084e589260 status 0 opcode 8 length 0
>>>> [  177.272397] RPC:       rpcrdma_event_process: event rep
>>>> (null) status C opcode FFFF8808 length 4294967295
>>>> [  177.272649] RPC:       rpcrdma_event_process: event rep
>>>> ffff880848ed0000 status 5 opcode FFFF8808 length 4294936584
>>> The mlx4 provider is returning a WC completion status of
>>> IB_WC_WR_FLUSH_ERR.
>>>
>>>> [  177.272651] RPC:       rpcrdma_event_process: WC opcode -30712 status
>>>> 5, connection lost
>>> -30712 is a bogus WC opcode. So the mlx4 provider is not filling in the
>>> WC opcode. rpcrdma_event_process() thus can’t depend on the contents of
>>> the ib_wc.opcode field when the WC completion status != IB_WC_SUCCESS.
>> Hey Chuck,
>>
>> That is correct, the opcode field in the wc is not reliable in FLUSH errors.
>>
>>> A copy of the opcode reachable from the incoming rpcrdma_rep could be
>>> added, initialized in the forward paths. rpcrdma_event_process() could
>>> use the copy in the error case.
>> How about suppressing completions alltogether for fast_reg and local_inv work requests?
>> if these shall fail you will get an error completion and the QP will transition to error state
>> generating FLUSH_ERR completions for all pending WRs. In this case, you can just ignore
>> flush fast_reg + local_inv errors.
>>
>> see http://marc.info/?l=linux-rdma&m=139047309831997&w=2
> While considering your suggestion, I see that my proposed fix doesn’t work. In the FAST_REG_MR and LOCAL_INV cases, wr_id points to a struct rpcrdma_mw, not a struct rpcrdma_rep. Putting a copy of the opcode in rpcrdma_rep would have no effect. Worse:
>
>>   158         if (IB_WC_SUCCESS != wc->status) {
>>   159                 dprintk("RPC:       %s: WC opcode %d status %X, connection lost\n",
>>   160                         __func__, wc->opcode, wc->status);
>>   161                 rep->rr_len = ~0U;
> Suppose this is an IB_WC_FAST_REG_MR completion, so “rep” here is actually a struct rpcrdma_mw, not a struct rpcrdma_rep. Line 161 pokes 32 one-bits at the top of that struct rpcrdma_mw. If wc->opcode was always usable, we’d at least have to fix that.

<nod>
So for error completions one needs to be careful dereferencing wr_id as 
the opcode might not reliable. it will be better to first identify that 
wr_id is indeed a pointer to rep.

>>   162                 if (wc->opcode != IB_WC_FAST_REG_MR && wc->opcode != IB_WC_LOCAL_INV)
>>   163                         rpcrdma_schedule_tasklet(rep);
>>   164                 return;
>>   165         }
>>   166
>>   167         switch (wc->opcode) {
>>   168         case IB_WC_FAST_REG_MR:
>>   169                 frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
>>   170                 frmr->r.frmr.state = FRMR_IS_VALID;
>>   171                 break;
>
> To make my initial solution work, you’d have to add a field to both struct rpcrdma_mw and struct rpcrdma_rep, and ensure they are at the same offset in both structures. Ewe.
>
> Eliminating completions for FAST_REG_MR and LOCAL_INV might be a preferable way to address this.

Agree, Same applies for MW.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever III April 4, 2014, 3:20 p.m. UTC | #9
On Mar 28, 2014, at 8:52 PM, sagi grimberg <sagig@mellanox.com> wrote:

> On 3/29/2014 3:05 AM, Chuck Lever wrote:
>> On Mar 28, 2014, at 4:06 PM, sagi grimberg <sagig@mellanox.com> wrote:
>> 
>>> On 3/29/2014 1:30 AM, Chuck Lever wrote:
>>>> On Mar 28, 2014, at 2:42 AM, Senn Klemens <klemens.senn@ims.co.at> wrote:
>>>> 
>>>>> Hi Chuck,
>>>>> 
>>>>> On 03/27/2014 04:59 PM, Chuck Lever wrote:
>>>>>> Hi-
>>>>>> 
>>>>>> 
>>>>>> On Mar 27, 2014, at 12:53 AM, Reiter Rafael <rafael.reiter@ims.co.at> wrote:
>>>>>> 
>>>>>>> On 03/26/2014 07:15 PM, Chuck Lever wrote:
>>>>>>>> Hi Rafael-
>>>>>>>> 
>>>>>>>> I’ll take a look. Can you report your HCA and how you reproduce this issue?
>>>>>>> The HCA is Mellanox Technologies MT26428.
>>>>>>> 
>>>>>>> Reproduction:
>>>>>>> 1) Mount a directory via NFS/RDMA
>>>>>>> mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/
>>>>> An additional "ls /mnt" is needed here (between step 1 and 2)
>>>>> 
>>>>>>> 2) Pull the Infiniband cable or use ibportstate to disrupt the Infiniband connection
>>>>>>> 3) ls /mnt
>>>>>>> 4) wait 5-30 seconds
>>>>>> Thanks for the information.
>>>>>> 
>>>>>> I have that HCA, but I won’t have access to my test systems for a week (traveling). So can you try this:
>>>>>> 
>>>>>> # rpcdebug -m rpc -s trans
>>>>>> 
>>>>>> then reproduce (starting with step 1 above). Some debugging output will appear at the tail of /var/log/messages. Copy it to this thread.
>>>>>> 
>>>>> The output of /var/log/messages is:
>>>>> 
>>>>> [  143.233701] RPC:  1688 xprt_rdma_allocate: size 1112 too large for
>>>>> buffer[1024]: prog 100003 vers 4 proc 1
>>>>> [  143.233708] RPC:  1688 xprt_rdma_allocate: size 1112, request
>>>>> 0xffff88105894c000
>>>>> [  143.233715] RPC:  1688 rpcrdma_inline_pullup: pad 0 destp
>>>>> 0xffff88105894d7dc len 124 hdrlen 124
>>>>> [  143.233718] RPC:       rpcrdma_register_frmr_external: Using frmr
>>>>> ffff88084e589260 to map 1 segments
>>>>> [  143.233722] RPC:  1688 rpcrdma_create_chunks: reply chunk elem
>>>>> 652@0x105894d92c:0xced01 (last)
>>>>> [  143.233725] RPC:  1688 rpcrdma_marshal_req: reply chunk: hdrlen 48
>>>>> rpclen 124 padlen 0 headerp 0xffff88105894d100 base 0xffff88105894d760
>>>>> lkey 0x8000
>>>>> [  143.233785] RPC:       rpcrdma_event_process: event rep
>>>>> ffff88084e589260 status 0 opcode 8 length 0
>>>>> [  177.272397] RPC:       rpcrdma_event_process: event rep
>>>>> (null) status C opcode FFFF8808 length 4294967295
>>>>> [  177.272649] RPC:       rpcrdma_event_process: event rep
>>>>> ffff880848ed0000 status 5 opcode FFFF8808 length 4294936584
>>>> The mlx4 provider is returning a WC completion status of
>>>> IB_WC_WR_FLUSH_ERR.
>>>> 
>>>>> [  177.272651] RPC:       rpcrdma_event_process: WC opcode -30712 status
>>>>> 5, connection lost
>>>> -30712 is a bogus WC opcode. So the mlx4 provider is not filling in the
>>>> WC opcode. rpcrdma_event_process() thus can’t depend on the contents of
>>>> the ib_wc.opcode field when the WC completion status != IB_WC_SUCCESS.
>>> Hey Chuck,
>>> 
>>> That is correct, the opcode field in the wc is not reliable in FLUSH errors.
>>> 
>>>> A copy of the opcode reachable from the incoming rpcrdma_rep could be
>>>> added, initialized in the forward paths. rpcrdma_event_process() could
>>>> use the copy in the error case.
>>> How about suppressing completions alltogether for fast_reg and local_inv work requests?
>>> if these shall fail you will get an error completion and the QP will transition to error state
>>> generating FLUSH_ERR completions for all pending WRs. In this case, you can just ignore
>>> flush fast_reg + local_inv errors.
>>> 
>>> see http://marc.info/?l=linux-rdma&m=139047309831997&w=2
>> While considering your suggestion, I see that my proposed fix doesn’t work. In the FAST_REG_MR and LOCAL_INV cases, wr_id points to a struct rpcrdma_mw, not a struct rpcrdma_rep. Putting a copy of the opcode in rpcrdma_rep would have no effect. Worse:
>> 
>>>  158         if (IB_WC_SUCCESS != wc->status) {
>>>  159                 dprintk("RPC:       %s: WC opcode %d status %X, connection lost\n",
>>>  160                         __func__, wc->opcode, wc->status);
>>>  161                 rep->rr_len = ~0U;
>> Suppose this is an IB_WC_FAST_REG_MR completion, so “rep” here is actually a struct rpcrdma_mw, not a struct rpcrdma_rep. Line 161 pokes 32 one-bits at the top of that struct rpcrdma_mw. If wc->opcode was always usable, we’d at least have to fix that.
> 
> <nod>
> So for error completions one needs to be careful dereferencing wr_id as the opcode might not reliable. it will be better to first identify that wr_id is indeed a pointer to rep.
> 
>>>  162                 if (wc->opcode != IB_WC_FAST_REG_MR && wc->opcode != IB_WC_LOCAL_INV)
>>>  163                         rpcrdma_schedule_tasklet(rep);
>>>  164                 return;
>>>  165         }
>>>  166
>>>  167         switch (wc->opcode) {
>>>  168         case IB_WC_FAST_REG_MR:
>>>  169                 frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
>>>  170                 frmr->r.frmr.state = FRMR_IS_VALID;
>>>  171                 break;
>> 
>> To make my initial solution work, you’d have to add a field to both struct rpcrdma_mw and struct rpcrdma_rep, and ensure they are at the same offset in both structures. Ewe.
>> 
>> Eliminating completions for FAST_REG_MR and LOCAL_INV might be a preferable way to address this.
> 
> Agree, Same applies for MW.

Tracking this issue here:

  https://bugzilla.kernel.org/show_bug.cgi?id=73211
diff mbox

Patch

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 93726560..9bf8882 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -159,7 +159,7 @@  void rpcrdma_event_process(struct ib_wc *wc)
                dprintk("RPC:       %s: WC opcode %d status %X, connection lost\n",
                        __func__, wc->opcode, wc->status);
                rep->rr_len = ~0U;
-               if (wc->opcode != IB_WC_FAST_REG_MR && wc->opcode != IB_WC_LOCAL_INV)
+               if (wc->opcode == IB_WC_RECV || wc->opcode == IB_WC_BIND_MW)
                        rpcrdma_schedule_tasklet(rep);
                return;
        }