Message ID | 20110606094547.0c04d1c5@tlielax.poochiereds.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Sounds promising. Any others have thoughts about Jeff's proposed solution? Ben, If you get test data on this with and without patch - let us know. On Mon, Jun 6, 2011 at 8:45 AM, Jeff Layton <jlayton@redhat.com> wrote: > On Sat, 4 Jun 2011 07:19:23 -0400 > Jeff Layton <jlayton@redhat.com> wrote: > >> On Fri, 03 Jun 2011 22:03:43 -0700 >> Ben Greear <greearb@candelatech.com> wrote: >> >> > On 06/03/2011 06:42 PM, Jeff Layton wrote: >> > > On Fri, 03 Jun 2011 14:01:11 -0700 >> > > Ben Greear<greearb@candelatech.com> wrote: >> > > >> > >> Ok, we had some luck. Here's the backtrace and attending dmesg >> > >> output. The filer has been doing failover, but it has not gone >> > >> into a failed state...so, the system *should* be able to reconnect. >> > >> >> > >> We have the system in the failed state now and will leave it that way >> > >> for a bit in case you have some commands you'd like me to run. >> > >> >> > >> Aside from the hung cifs processes (anything accessing those mounts >> > >> gets into the D state), the system seems fine. >> > >> >> > >> >> > >> CIFS VFS: Unexpected lookup error -112 >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Unexpected lookup error -11 >> > >> CIFS VFS: Unexpected lookup error -112 >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Unexpected lookup error -112 >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Unexpected lookup error -11 >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: Reconnecting tcp session >> > >> CIFS VFS: need to reconnect in sendv here >> > >> ------------[ cut here ]------------ >> > >> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]() >> > >> BUG: unable to handle kernel >> > >> Hardware name: X8ST3 >> > >> NULL pointer dereference >> > >> Modules linked in: at 0000000000000020 >> > >> be2iscsi >> > >> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e >> > >> bnx2iPGD 0 cnic >> > >> uio >> > >> Oops: 0000 [#1] cxgb3iPREEMPT libcxgbiSMP cxgb3 >> > >> mdio >> > >> last sysfs file: /sys/devices/platform/host10/session7/target10:0:0/10:0:0:0/block/sde/sde1/stat >> > >> ib_iserCPU 2 rdma_cm >> > >> Modules linked in: ib_cm be2iscsi iw_cm iscsi_boot_sysfs ib_sa bnx2i ib_mad cnic ib_core uio ib_addr cxgb3i md4 libcxgbi nls_utf8 cxgb3 cifs mdio xt_TPROXY >> > >> ib_iser rdma_cm nf_tproxy_core ib_cm xt_socket iw_cm ib_sa ip6_tables ib_mad ib_core nf_defrag_ipv6 ib_addr md4 nls_utf8 xt_connlimit cifs xt_TPROXY >> > >> nf_tproxy_core xt_socket ip6_tables nf_defrag_ipv6 xt_connlimit 8021q garp bridge stp llc fuse macvlan wanlink(P) pktgen iscsi_tcp libiscsi_tcp libiscsi >> > >> scsi_transport_iscsi nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput i2c_i801 e1000e 8021q i2c_core garp igb bridge ioatdma stp iTCO_wdt llc >> > >> i7core_edac fuse iTCO_vendor_support macvlan pcspkr wanlink(P) dca pktgen edac_core iscsi_tcp microcode [last unloaded: ipt_addrtype] libiscsi_tcp >> > >> libiscsi >> > >> scsi_transport_iscsi >> > >> Pid: 5047, comm: cifsd Tainted: P 2.6.38.8+ #12 nfs lockdSupermicro X8ST3 fscache/X8ST3 nfs_acl >> > >> auth_rpcgss >> > >> RIP: 0010:[<ffffffff81356230>] [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e >> > >> sunrpc >> > >> RSP: 0018:ffff8802e64e5bc0 EFLAGS: 00010286 >> > >> ipv6 >> > >> RAX: 0000000000000000 RBX: ffff8802e64e5c40 RCX: 0000000000000004 >> > >> uinput >> > >> RDX: ffff8802e64e5e40 RSI: 0000000000000000 RDI: ffff8802e64e5c40 >> > >> i2c_i801 >> > >> RBP: ffff8802e64e5bf0 R08: 0000000000000000 R09: 0000000000000000 >> > >> R10: ffff8802e64e5d80 R11: ffff8802e64e5e40 R12: ffff8802e64e5d10 >> > >> e1000e >> > >> R13: 0000000000000000 R14: ffff8802e64e5c40 R15: ffff8802e6429f80 >> > >> i2c_core >> > >> FS: 0000000000000000(0000) GS:ffff8800df440000(0000) knlGS:0000000000000000 >> > >> igb >> > >> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >> > >> ioatdma >> > >> CR2: 0000000000000020 CR3: 0000000001803000 CR4: 00000000000006e0 >> > >> iTCO_wdt >> > >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> > >> i7core_edac >> > >> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 >> > >> iTCO_vendor_supportProcess cifsd (pid: 5047, threadinfo ffff8802e64e4000, task ffff88030482d880) >> > >> pcspkr >> > >> Stack: >> > >> dca ffff8802e64e5c10 edac_core ffffffff81039b72 microcode ffff880200000001 [last unloaded: ipt_addrtype] ffff880305a40fc8 >> > >> >> > >> ffff8802e64e5e40Pid: 4754, comm: btserver Tainted: P 2.6.38.8+ #12 >> > >> 0000000000000004Call Trace: >> > >> ffff8802e64e5c30 ffffffff8135792c >> > >> 0000000000000000 0000000000000000 [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98 >> > >> ffff8802e64e5c40 ffffffffffffffff [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17 >> > >> >> > >> Call Trace: >> > >> [<ffffffffa0330a2c>] ? smb_sendv+0x7a/0x2cf [cifs] >> > >> [<ffffffff81039b72>] ? select_idle_sibling+0xec/0x127 >> > >> [<ffffffff8135792c>] __sock_recvmsg+0x49/0x54 >> > >> [<ffffffff81357e96>] sock_recvmsg+0xa6/0xbf >> > >> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 >> > >> [<ffffffff81041ee4>] ? try_to_wake_up+0x1ad/0x1c8 >> > >> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 >> > >> [<ffffffff81041f0c>] ? default_wake_function+0xd/0xf >> > >> [<ffffffff8105c7e0>] ? autoremove_wake_function+0x11/0x34 >> > >> [<ffffffffa0330ca2>] ? smb_send+0x21/0x23 [cifs] >> > >> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 >> > >> [<ffffffff814164ec>] ? sub_preempt_count+0x92/0xa5 >> > >> [<ffffffffa03311d1>] ? SendReceive+0x13f/0x317 [cifs] >> > >> [<ffffffff814133f8>] ? _raw_spin_unlock_irqrestore+0x3a/0x47 >> > >> [<ffffffff810382c6>] ? __wake_up+0x3f/0x48 >> > >> [<ffffffffa031d839>] ? CIFSSMBNegotiate+0x191/0x766 [cifs] >> > >> [<ffffffff81357ee4>] kernel_recvmsg+0x35/0x41 >> > >> [<ffffffff81412526>] ? __mutex_lock_common+0x358/0x3bc >> > >> [<ffffffffa0321d20>] cifs_demultiplex_thread+0x21e/0xcd9 [cifs] >> > >> [<ffffffffa031fd7b>] ? cifs_negotiate_protocol+0x37/0x87 [cifs] >> > >> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 >> > >> [<ffffffff8141259e>] ? __mutex_lock_slowpath+0x14/0x16 >> > >> [<ffffffff8103838e>] ? need_resched+0x1e/0x28 >> > >> [<ffffffffa0315fed>] ? cifs_reconnect_tcon+0x19a/0x2c9 [cifs] >> > >> [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs] >> > >> [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34 >> > >> [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs] >> > >> [<ffffffff8105c3bf>] kthread+0x7d/0x85 >> > >> [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs] >> > >> [<ffffffff8100b8e4>] kernel_thread_helper+0x4/0x10 >> > >> [<ffffffffa031c0ad>] ? CIFSSMBWrite2+0xa3/0x242 [cifs] >> > >> [<ffffffff8105c342>] ? kthread+0x0/0x85 >> > >> [<ffffffff8100b8e0>] ? kernel_thread_helper+0x0/0x10 >> > >> [<ffffffffa032a117>] ? cifs_writepages+0x461/0x714 [cifs] >> > >> Code: 48 8b 4d d8 [<ffffffff810ac1e8>] ? do_writepages+0x1f/0x28 >> > >> 48 8b [<ffffffff810a4735>] ? __filemap_fdatawrite_range+0x4e/0x50 >> > >> 55 e0 48 [<ffffffff810a4c3c>] ? filemap_fdatawrite+0x1a/0x1c >> > >> 8b 75 [<ffffffff810a4c56>] ? filemap_write_and_wait+0x18/0x33 >> > >> e8 ff 90 [<ffffffffa0326648>] ? cifs_flush+0x2d/0x60 [cifs] >> > >> a8 00 [<ffffffff810e901f>] ? filp_close+0x3e/0x6d >> > >> 00 00 [<ffffffff810e90f6>] ? sys_close+0xa8/0xe2 >> > >> 48 83 c4 [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b >> > >> 28 5b >> > >> ---[ end trace 3387e7bab0a9c645 ]--- >> > > >> > > Kaboom. So you're seeing oopses too. Could you get a listing of the >> > > place where it oopsed by following the instructions here? >> > > >> > > http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting#Oopses >> > > >> > > I suspect that "sock" is NULL in this case too and it blew up in >> > > kernel_recvmsg. >> > >> > I added code to WARN_ON when ssocket was null. This isn't a real panic, >> > just a WARN_ON: >> > >> > >> > static int >> > smb_sendv(struct TCP_Server_Info *server, struct kvec *iov, int n_vec) >> > { >> > int rc = 0; >> > int i = 0; >> > struct msghdr smb_msg; >> > struct smb_hdr *smb_buffer = iov[0].iov_base; >> > unsigned int len = iov[0].iov_len; >> > unsigned int total_len; >> > int first_vec = 0; >> > unsigned int smb_buf_length = smb_buffer->smb_buf_length; >> > struct socket *ssocket = server->ssocket; >> > >> > if (ssocket == NULL) { >> > cERROR(1, "need to reconnect in sendv here"); >> > *** HERE *** WARN_ON_ONCE(1); >> > return -ENOTSOCK; /* BB eventually add reconnect code here */ >> > } >> > >> > A second warn-on when ENOTSOCK is perculated up to the calling stack >> > a bit causes the other stack dumpage. I think the one above is root >> > cause...need to figure out how to have it gracefully bail out and re-connect >> > when it hits this state, as current code just calls this general loop over >> > and over again. >> > >> >> No, your warning is there, but it's Oopsing too: >> >> > >> ------------[ cut here ]------------ >> > >> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]() >> > >> BUG: unable to handle kernel >> > >> Hardware name: X8ST3 >> > >> NULL pointer dereference >> > >> Modules linked in: at 0000000000000020 >> > >> be2iscsi >> > >> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e >> >> >> ...smb_sendv is called by the "send" side which is generally a >> userspace process. The oops happened on the receive side. cifsd called >> kernel_recvmsg, and it looks like it passed in a NULL sock pointer. >> > > I suspect that the following (untested) patch will fix this. I think > the symptoms that you've seen are consistent with the patch > description. Ben, would you be able to test this in your setup? This > should at least prevent the oopses. > > ------------------[snip]-------------------- > > [PATCH] cifs: don't allow cifs_reconnect to exit with NULL socket pointer > > It's possible for the following set of events to happen: > > cifsd calls cifs_reconnect which reconnects the socket. A userspace > process then calls cifs_negotiate_protocol to handle the NEGOTIATE and > gets a reply. But, while processing the reply, cifsd calls > cifs_reconnect again. Eventually the GlobalMid_Lock is dropped and the > reply from the earlier NEGOTIATE completes and the tcpStatus is set to > CifsGood. cifs_reconnect then goes through and closes the socket and sets the > pointer to zero, but because the status is now CifsGood, the new socket > is not created and cifs_reconnect exits with the socket pointer set to > NULL. > > Fix this by only setting the tcpStatus to CifsGood if the tcpStatus is > CifsNeedNegotiate, and by making sure that generic_ip_connect is always > called at least once in cifs_reconnect. > > Note that this is not a perfect fix for this issue. It's still possible > that the NEGOTIATE reply is handled after the socket has been closed and > reconnected. In that case, the socket state will look correct but it no > NEGOTIATE was performed on it. In that situation though the server > should just shut down the socket on the next attempted send, rather > than causing the oops that occurs today. > > Reported-by: Ben Greear <greearb@candelatech.com> > Signed-off-by: Jeff Layton <jlayton@redhat.com> > --- > fs/cifs/connect.c | 6 +++--- > 1 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c > index 84c7307..8bb55bc 100644 > --- a/fs/cifs/connect.c > +++ b/fs/cifs/connect.c > @@ -152,7 +152,7 @@ cifs_reconnect(struct TCP_Server_Info *server) > mid_entry->callback(mid_entry); > } > > - while (server->tcpStatus == CifsNeedReconnect) { > + do { > try_to_freeze(); > > /* we should try only the port we connected to before */ > @@ -167,7 +167,7 @@ cifs_reconnect(struct TCP_Server_Info *server) > server->tcpStatus = CifsNeedNegotiate; > spin_unlock(&GlobalMid_Lock); > } > - } > + } while (server->tcpStatus == CifsNeedReconnect); > > return rc; > } > @@ -3371,7 +3371,7 @@ int cifs_negotiate_protocol(unsigned int xid, struct cifs_ses *ses) > } > if (rc == 0) { > spin_lock(&GlobalMid_Lock); > - if (server->tcpStatus != CifsExiting) > + if (server->tcpStatus == CifsNeedNegotiate) > server->tcpStatus = CifsGood; > else > rc = -EHOSTDOWN; > -- > 1.7.5.2 > > > -- > Jeff Layton <jlayton@redhat.com> >
On 06/06/2011 06:45 AM, Jeff Layton wrote: > On Sat, 4 Jun 2011 07:19:23 -0400 > Jeff Layton<jlayton@redhat.com> wrote: > >> On Fri, 03 Jun 2011 22:03:43 -0700 >> Ben Greear<greearb@candelatech.com> wrote: >> >>> On 06/03/2011 06:42 PM, Jeff Layton wrote: >>>> On Fri, 03 Jun 2011 14:01:11 -0700 >>>> Ben Greear<greearb@candelatech.com> wrote: >>>> >>>>> Ok, we had some luck. Here's the backtrace and attending dmesg >>>>> output. The filer has been doing failover, but it has not gone >>>>> into a failed state...so, the system *should* be able to reconnect. >>>>> >>>>> We have the system in the failed state now and will leave it that way >>>>> for a bit in case you have some commands you'd like me to run. >>>>> >>>>> Aside from the hung cifs processes (anything accessing those mounts >>>>> gets into the D state), the system seems fine. >>>>> >>>>> >>>>> CIFS VFS: Unexpected lookup error -112 >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Unexpected lookup error -11 >>>>> CIFS VFS: Unexpected lookup error -112 >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Unexpected lookup error -112 >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Unexpected lookup error -11 >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: Reconnecting tcp session >>>>> CIFS VFS: need to reconnect in sendv here >>>>> ------------[ cut here ]------------ >>>>> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]() >>>>> BUG: unable to handle kernel >>>>> Hardware name: X8ST3 >>>>> NULL pointer dereference >>>>> Modules linked in: at 0000000000000020 >>>>> be2iscsi >>>>> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e >>>>> bnx2iPGD 0 cnic >>>>> uio >>>>> Oops: 0000 [#1] cxgb3iPREEMPT libcxgbiSMP cxgb3 >>>>> mdio >>>>> last sysfs file: /sys/devices/platform/host10/session7/target10:0:0/10:0:0:0/block/sde/sde1/stat >>>>> ib_iserCPU 2 rdma_cm >>>>> Modules linked in: ib_cm be2iscsi iw_cm iscsi_boot_sysfs ib_sa bnx2i ib_mad cnic ib_core uio ib_addr cxgb3i md4 libcxgbi nls_utf8 cxgb3 cifs mdio xt_TPROXY >>>>> ib_iser rdma_cm nf_tproxy_core ib_cm xt_socket iw_cm ib_sa ip6_tables ib_mad ib_core nf_defrag_ipv6 ib_addr md4 nls_utf8 xt_connlimit cifs xt_TPROXY >>>>> nf_tproxy_core xt_socket ip6_tables nf_defrag_ipv6 xt_connlimit 8021q garp bridge stp llc fuse macvlan wanlink(P) pktgen iscsi_tcp libiscsi_tcp libiscsi >>>>> scsi_transport_iscsi nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput i2c_i801 e1000e 8021q i2c_core garp igb bridge ioatdma stp iTCO_wdt llc >>>>> i7core_edac fuse iTCO_vendor_support macvlan pcspkr wanlink(P) dca pktgen edac_core iscsi_tcp microcode [last unloaded: ipt_addrtype] libiscsi_tcp >>>>> libiscsi >>>>> scsi_transport_iscsi >>>>> Pid: 5047, comm: cifsd Tainted: P 2.6.38.8+ #12 nfs lockdSupermicro X8ST3 fscache/X8ST3 nfs_acl >>>>> auth_rpcgss >>>>> RIP: 0010:[<ffffffff81356230>] [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e >>>>> sunrpc >>>>> RSP: 0018:ffff8802e64e5bc0 EFLAGS: 00010286 >>>>> ipv6 >>>>> RAX: 0000000000000000 RBX: ffff8802e64e5c40 RCX: 0000000000000004 >>>>> uinput >>>>> RDX: ffff8802e64e5e40 RSI: 0000000000000000 RDI: ffff8802e64e5c40 >>>>> i2c_i801 >>>>> RBP: ffff8802e64e5bf0 R08: 0000000000000000 R09: 0000000000000000 >>>>> R10: ffff8802e64e5d80 R11: ffff8802e64e5e40 R12: ffff8802e64e5d10 >>>>> e1000e >>>>> R13: 0000000000000000 R14: ffff8802e64e5c40 R15: ffff8802e6429f80 >>>>> i2c_core >>>>> FS: 0000000000000000(0000) GS:ffff8800df440000(0000) knlGS:0000000000000000 >>>>> igb >>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >>>>> ioatdma >>>>> CR2: 0000000000000020 CR3: 0000000001803000 CR4: 00000000000006e0 >>>>> iTCO_wdt >>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>>>> i7core_edac >>>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 >>>>> iTCO_vendor_supportProcess cifsd (pid: 5047, threadinfo ffff8802e64e4000, task ffff88030482d880) >>>>> pcspkr >>>>> Stack: >>>>> dca ffff8802e64e5c10 edac_core ffffffff81039b72 microcode ffff880200000001 [last unloaded: ipt_addrtype] ffff880305a40fc8 >>>>> >>>>> ffff8802e64e5e40Pid: 4754, comm: btserver Tainted: P 2.6.38.8+ #12 >>>>> 0000000000000004Call Trace: >>>>> ffff8802e64e5c30 ffffffff8135792c >>>>> 0000000000000000 0000000000000000 [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98 >>>>> ffff8802e64e5c40 ffffffffffffffff [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17 >>>>> >>>>> Call Trace: >>>>> [<ffffffffa0330a2c>] ? smb_sendv+0x7a/0x2cf [cifs] >>>>> [<ffffffff81039b72>] ? select_idle_sibling+0xec/0x127 >>>>> [<ffffffff8135792c>] __sock_recvmsg+0x49/0x54 >>>>> [<ffffffff81357e96>] sock_recvmsg+0xa6/0xbf >>>>> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 >>>>> [<ffffffff81041ee4>] ? try_to_wake_up+0x1ad/0x1c8 >>>>> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 >>>>> [<ffffffff81041f0c>] ? default_wake_function+0xd/0xf >>>>> [<ffffffff8105c7e0>] ? autoremove_wake_function+0x11/0x34 >>>>> [<ffffffffa0330ca2>] ? smb_send+0x21/0x23 [cifs] >>>>> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 >>>>> [<ffffffff814164ec>] ? sub_preempt_count+0x92/0xa5 >>>>> [<ffffffffa03311d1>] ? SendReceive+0x13f/0x317 [cifs] >>>>> [<ffffffff814133f8>] ? _raw_spin_unlock_irqrestore+0x3a/0x47 >>>>> [<ffffffff810382c6>] ? __wake_up+0x3f/0x48 >>>>> [<ffffffffa031d839>] ? CIFSSMBNegotiate+0x191/0x766 [cifs] >>>>> [<ffffffff81357ee4>] kernel_recvmsg+0x35/0x41 >>>>> [<ffffffff81412526>] ? __mutex_lock_common+0x358/0x3bc >>>>> [<ffffffffa0321d20>] cifs_demultiplex_thread+0x21e/0xcd9 [cifs] >>>>> [<ffffffffa031fd7b>] ? cifs_negotiate_protocol+0x37/0x87 [cifs] >>>>> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 >>>>> [<ffffffff8141259e>] ? __mutex_lock_slowpath+0x14/0x16 >>>>> [<ffffffff8103838e>] ? need_resched+0x1e/0x28 >>>>> [<ffffffffa0315fed>] ? cifs_reconnect_tcon+0x19a/0x2c9 [cifs] >>>>> [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs] >>>>> [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34 >>>>> [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs] >>>>> [<ffffffff8105c3bf>] kthread+0x7d/0x85 >>>>> [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs] >>>>> [<ffffffff8100b8e4>] kernel_thread_helper+0x4/0x10 >>>>> [<ffffffffa031c0ad>] ? CIFSSMBWrite2+0xa3/0x242 [cifs] >>>>> [<ffffffff8105c342>] ? kthread+0x0/0x85 >>>>> [<ffffffff8100b8e0>] ? kernel_thread_helper+0x0/0x10 >>>>> [<ffffffffa032a117>] ? cifs_writepages+0x461/0x714 [cifs] >>>>> Code: 48 8b 4d d8 [<ffffffff810ac1e8>] ? do_writepages+0x1f/0x28 >>>>> 48 8b [<ffffffff810a4735>] ? __filemap_fdatawrite_range+0x4e/0x50 >>>>> 55 e0 48 [<ffffffff810a4c3c>] ? filemap_fdatawrite+0x1a/0x1c >>>>> 8b 75 [<ffffffff810a4c56>] ? filemap_write_and_wait+0x18/0x33 >>>>> e8 ff 90 [<ffffffffa0326648>] ? cifs_flush+0x2d/0x60 [cifs] >>>>> a8 00 [<ffffffff810e901f>] ? filp_close+0x3e/0x6d >>>>> 00 00 [<ffffffff810e90f6>] ? sys_close+0xa8/0xe2 >>>>> 48 83 c4 [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b >>>>> 28 5b >>>>> ---[ end trace 3387e7bab0a9c645 ]--- >>>> >>>> Kaboom. So you're seeing oopses too. Could you get a listing of the >>>> place where it oopsed by following the instructions here? >>>> >>>> http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting#Oopses >>>> >>>> I suspect that "sock" is NULL in this case too and it blew up in >>>> kernel_recvmsg. >>> >>> I added code to WARN_ON when ssocket was null. This isn't a real panic, >>> just a WARN_ON: >>> >>> >>> static int >>> smb_sendv(struct TCP_Server_Info *server, struct kvec *iov, int n_vec) >>> { >>> int rc = 0; >>> int i = 0; >>> struct msghdr smb_msg; >>> struct smb_hdr *smb_buffer = iov[0].iov_base; >>> unsigned int len = iov[0].iov_len; >>> unsigned int total_len; >>> int first_vec = 0; >>> unsigned int smb_buf_length = smb_buffer->smb_buf_length; >>> struct socket *ssocket = server->ssocket; >>> >>> if (ssocket == NULL) { >>> cERROR(1, "need to reconnect in sendv here"); >>> *** HERE *** WARN_ON_ONCE(1); >>> return -ENOTSOCK; /* BB eventually add reconnect code here */ >>> } >>> >>> A second warn-on when ENOTSOCK is perculated up to the calling stack >>> a bit causes the other stack dumpage. I think the one above is root >>> cause...need to figure out how to have it gracefully bail out and re-connect >>> when it hits this state, as current code just calls this general loop over >>> and over again. >>> >> >> No, your warning is there, but it's Oopsing too: >> >>>>> ------------[ cut here ]------------ >>>>> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]() >>>>> BUG: unable to handle kernel >>>>> Hardware name: X8ST3 >>>>> NULL pointer dereference >>>>> Modules linked in: at 0000000000000020 >>>>> be2iscsi >>>>> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e >> >> >> ...smb_sendv is called by the "send" side which is generally a >> userspace process. The oops happened on the receive side. cifsd called >> kernel_recvmsg, and it looks like it passed in a NULL sock pointer. >> > > I suspect that the following (untested) patch will fix this. I think > the symptoms that you've seen are consistent with the patch > description. Ben, would you be able to test this in your setup? This > should at least prevent the oopses. > > ------------------[snip]-------------------- > > [PATCH] cifs: don't allow cifs_reconnect to exit with NULL socket pointer > > It's possible for the following set of events to happen: > > cifsd calls cifs_reconnect which reconnects the socket. A userspace > process then calls cifs_negotiate_protocol to handle the NEGOTIATE and > gets a reply. But, while processing the reply, cifsd calls > cifs_reconnect again. Eventually the GlobalMid_Lock is dropped and the > reply from the earlier NEGOTIATE completes and the tcpStatus is set to > CifsGood. cifs_reconnect then goes through and closes the socket and sets the > pointer to zero, but because the status is now CifsGood, the new socket > is not created and cifs_reconnect exits with the socket pointer set to > NULL. > > Fix this by only setting the tcpStatus to CifsGood if the tcpStatus is > CifsNeedNegotiate, and by making sure that generic_ip_connect is always > called at least once in cifs_reconnect. > > Note that this is not a perfect fix for this issue. It's still possible > that the NEGOTIATE reply is handled after the socket has been closed and > reconnected. In that case, the socket state will look correct but it no > NEGOTIATE was performed on it. In that situation though the server > should just shut down the socket on the next attempted send, rather > than causing the oops that occurs today. > > Reported-by: Ben Greear<greearb@candelatech.com> > Signed-off-by: Jeff Layton<jlayton@redhat.com> > --- > fs/cifs/connect.c | 6 +++--- > 1 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c > index 84c7307..8bb55bc 100644 > --- a/fs/cifs/connect.c > +++ b/fs/cifs/connect.c > @@ -152,7 +152,7 @@ cifs_reconnect(struct TCP_Server_Info *server) > mid_entry->callback(mid_entry); > } > > - while (server->tcpStatus == CifsNeedReconnect) { > + do { > try_to_freeze(); > > /* we should try only the port we connected to before */ > @@ -167,7 +167,7 @@ cifs_reconnect(struct TCP_Server_Info *server) > server->tcpStatus = CifsNeedNegotiate; > spin_unlock(&GlobalMid_Lock); > } > - } > + } while (server->tcpStatus == CifsNeedReconnect); > > return rc; > } > @@ -3371,7 +3371,7 @@ int cifs_negotiate_protocol(unsigned int xid, struct cifs_ses *ses) > } > if (rc == 0) { > spin_lock(&GlobalMid_Lock); > - if (server->tcpStatus != CifsExiting) > + if (server->tcpStatus == CifsNeedNegotiate) > server->tcpStatus = CifsGood; > else > rc = -EHOSTDOWN; This has some merge issues on 3.6.38.8: <<<<<<< while ((server->tcpStatus != CifsExiting) && (server->tcpStatus != CifsGood)) { ======= do { >>>>>>> Should I keep your comparison for tcpStatus == CifsNeedReconnect instead of these != comparisons above? Thanks, Ben
On Mon, 06 Jun 2011 09:47:40 -0700 Ben Greear <greearb@candelatech.com> wrote: > On 06/06/2011 06:45 AM, Jeff Layton wrote: > > On Sat, 4 Jun 2011 07:19:23 -0400 > > Jeff Layton<jlayton@redhat.com> wrote: > > > >> On Fri, 03 Jun 2011 22:03:43 -0700 > >> Ben Greear<greearb@candelatech.com> wrote: > >> > >>> On 06/03/2011 06:42 PM, Jeff Layton wrote: > >>>> On Fri, 03 Jun 2011 14:01:11 -0700 > >>>> Ben Greear<greearb@candelatech.com> wrote: > >>>> > >>>>> Ok, we had some luck. Here's the backtrace and attending dmesg > >>>>> output. The filer has been doing failover, but it has not gone > >>>>> into a failed state...so, the system *should* be able to reconnect. > >>>>> > >>>>> We have the system in the failed state now and will leave it that way > >>>>> for a bit in case you have some commands you'd like me to run. > >>>>> > >>>>> Aside from the hung cifs processes (anything accessing those mounts > >>>>> gets into the D state), the system seems fine. > >>>>> > >>>>> > >>>>> CIFS VFS: Unexpected lookup error -112 > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Unexpected lookup error -11 > >>>>> CIFS VFS: Unexpected lookup error -112 > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Unexpected lookup error -112 > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Unexpected lookup error -11 > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: Reconnecting tcp session > >>>>> CIFS VFS: need to reconnect in sendv here > >>>>> ------------[ cut here ]------------ > >>>>> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]() > >>>>> BUG: unable to handle kernel > >>>>> Hardware name: X8ST3 > >>>>> NULL pointer dereference > >>>>> Modules linked in: at 0000000000000020 > >>>>> be2iscsi > >>>>> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e > >>>>> bnx2iPGD 0 cnic > >>>>> uio > >>>>> Oops: 0000 [#1] cxgb3iPREEMPT libcxgbiSMP cxgb3 > >>>>> mdio > >>>>> last sysfs file: /sys/devices/platform/host10/session7/target10:0:0/10:0:0:0/block/sde/sde1/stat > >>>>> ib_iserCPU 2 rdma_cm > >>>>> Modules linked in: ib_cm be2iscsi iw_cm iscsi_boot_sysfs ib_sa bnx2i ib_mad cnic ib_core uio ib_addr cxgb3i md4 libcxgbi nls_utf8 cxgb3 cifs mdio xt_TPROXY > >>>>> ib_iser rdma_cm nf_tproxy_core ib_cm xt_socket iw_cm ib_sa ip6_tables ib_mad ib_core nf_defrag_ipv6 ib_addr md4 nls_utf8 xt_connlimit cifs xt_TPROXY > >>>>> nf_tproxy_core xt_socket ip6_tables nf_defrag_ipv6 xt_connlimit 8021q garp bridge stp llc fuse macvlan wanlink(P) pktgen iscsi_tcp libiscsi_tcp libiscsi > >>>>> scsi_transport_iscsi nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput i2c_i801 e1000e 8021q i2c_core garp igb bridge ioatdma stp iTCO_wdt llc > >>>>> i7core_edac fuse iTCO_vendor_support macvlan pcspkr wanlink(P) dca pktgen edac_core iscsi_tcp microcode [last unloaded: ipt_addrtype] libiscsi_tcp > >>>>> libiscsi > >>>>> scsi_transport_iscsi > >>>>> Pid: 5047, comm: cifsd Tainted: P 2.6.38.8+ #12 nfs lockdSupermicro X8ST3 fscache/X8ST3 nfs_acl > >>>>> auth_rpcgss > >>>>> RIP: 0010:[<ffffffff81356230>] [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e > >>>>> sunrpc > >>>>> RSP: 0018:ffff8802e64e5bc0 EFLAGS: 00010286 > >>>>> ipv6 > >>>>> RAX: 0000000000000000 RBX: ffff8802e64e5c40 RCX: 0000000000000004 > >>>>> uinput > >>>>> RDX: ffff8802e64e5e40 RSI: 0000000000000000 RDI: ffff8802e64e5c40 > >>>>> i2c_i801 > >>>>> RBP: ffff8802e64e5bf0 R08: 0000000000000000 R09: 0000000000000000 > >>>>> R10: ffff8802e64e5d80 R11: ffff8802e64e5e40 R12: ffff8802e64e5d10 > >>>>> e1000e > >>>>> R13: 0000000000000000 R14: ffff8802e64e5c40 R15: ffff8802e6429f80 > >>>>> i2c_core > >>>>> FS: 0000000000000000(0000) GS:ffff8800df440000(0000) knlGS:0000000000000000 > >>>>> igb > >>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > >>>>> ioatdma > >>>>> CR2: 0000000000000020 CR3: 0000000001803000 CR4: 00000000000006e0 > >>>>> iTCO_wdt > >>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > >>>>> i7core_edac > >>>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > >>>>> iTCO_vendor_supportProcess cifsd (pid: 5047, threadinfo ffff8802e64e4000, task ffff88030482d880) > >>>>> pcspkr > >>>>> Stack: > >>>>> dca ffff8802e64e5c10 edac_core ffffffff81039b72 microcode ffff880200000001 [last unloaded: ipt_addrtype] ffff880305a40fc8 > >>>>> > >>>>> ffff8802e64e5e40Pid: 4754, comm: btserver Tainted: P 2.6.38.8+ #12 > >>>>> 0000000000000004Call Trace: > >>>>> ffff8802e64e5c30 ffffffff8135792c > >>>>> 0000000000000000 0000000000000000 [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98 > >>>>> ffff8802e64e5c40 ffffffffffffffff [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17 > >>>>> > >>>>> Call Trace: > >>>>> [<ffffffffa0330a2c>] ? smb_sendv+0x7a/0x2cf [cifs] > >>>>> [<ffffffff81039b72>] ? select_idle_sibling+0xec/0x127 > >>>>> [<ffffffff8135792c>] __sock_recvmsg+0x49/0x54 > >>>>> [<ffffffff81357e96>] sock_recvmsg+0xa6/0xbf > >>>>> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 > >>>>> [<ffffffff81041ee4>] ? try_to_wake_up+0x1ad/0x1c8 > >>>>> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 > >>>>> [<ffffffff81041f0c>] ? default_wake_function+0xd/0xf > >>>>> [<ffffffff8105c7e0>] ? autoremove_wake_function+0x11/0x34 > >>>>> [<ffffffffa0330ca2>] ? smb_send+0x21/0x23 [cifs] > >>>>> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 > >>>>> [<ffffffff814164ec>] ? sub_preempt_count+0x92/0xa5 > >>>>> [<ffffffffa03311d1>] ? SendReceive+0x13f/0x317 [cifs] > >>>>> [<ffffffff814133f8>] ? _raw_spin_unlock_irqrestore+0x3a/0x47 > >>>>> [<ffffffff810382c6>] ? __wake_up+0x3f/0x48 > >>>>> [<ffffffffa031d839>] ? CIFSSMBNegotiate+0x191/0x766 [cifs] > >>>>> [<ffffffff81357ee4>] kernel_recvmsg+0x35/0x41 > >>>>> [<ffffffff81412526>] ? __mutex_lock_common+0x358/0x3bc > >>>>> [<ffffffffa0321d20>] cifs_demultiplex_thread+0x21e/0xcd9 [cifs] > >>>>> [<ffffffffa031fd7b>] ? cifs_negotiate_protocol+0x37/0x87 [cifs] > >>>>> [<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42 > >>>>> [<ffffffff8141259e>] ? __mutex_lock_slowpath+0x14/0x16 > >>>>> [<ffffffff8103838e>] ? need_resched+0x1e/0x28 > >>>>> [<ffffffffa0315fed>] ? cifs_reconnect_tcon+0x19a/0x2c9 [cifs] > >>>>> [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs] > >>>>> [<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34 > >>>>> [<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs] > >>>>> [<ffffffff8105c3bf>] kthread+0x7d/0x85 > >>>>> [<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs] > >>>>> [<ffffffff8100b8e4>] kernel_thread_helper+0x4/0x10 > >>>>> [<ffffffffa031c0ad>] ? CIFSSMBWrite2+0xa3/0x242 [cifs] > >>>>> [<ffffffff8105c342>] ? kthread+0x0/0x85 > >>>>> [<ffffffff8100b8e0>] ? kernel_thread_helper+0x0/0x10 > >>>>> [<ffffffffa032a117>] ? cifs_writepages+0x461/0x714 [cifs] > >>>>> Code: 48 8b 4d d8 [<ffffffff810ac1e8>] ? do_writepages+0x1f/0x28 > >>>>> 48 8b [<ffffffff810a4735>] ? __filemap_fdatawrite_range+0x4e/0x50 > >>>>> 55 e0 48 [<ffffffff810a4c3c>] ? filemap_fdatawrite+0x1a/0x1c > >>>>> 8b 75 [<ffffffff810a4c56>] ? filemap_write_and_wait+0x18/0x33 > >>>>> e8 ff 90 [<ffffffffa0326648>] ? cifs_flush+0x2d/0x60 [cifs] > >>>>> a8 00 [<ffffffff810e901f>] ? filp_close+0x3e/0x6d > >>>>> 00 00 [<ffffffff810e90f6>] ? sys_close+0xa8/0xe2 > >>>>> 48 83 c4 [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b > >>>>> 28 5b > >>>>> ---[ end trace 3387e7bab0a9c645 ]--- > >>>> > >>>> Kaboom. So you're seeing oopses too. Could you get a listing of the > >>>> place where it oopsed by following the instructions here? > >>>> > >>>> http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting#Oopses > >>>> > >>>> I suspect that "sock" is NULL in this case too and it blew up in > >>>> kernel_recvmsg. > >>> > >>> I added code to WARN_ON when ssocket was null. This isn't a real panic, > >>> just a WARN_ON: > >>> > >>> > >>> static int > >>> smb_sendv(struct TCP_Server_Info *server, struct kvec *iov, int n_vec) > >>> { > >>> int rc = 0; > >>> int i = 0; > >>> struct msghdr smb_msg; > >>> struct smb_hdr *smb_buffer = iov[0].iov_base; > >>> unsigned int len = iov[0].iov_len; > >>> unsigned int total_len; > >>> int first_vec = 0; > >>> unsigned int smb_buf_length = smb_buffer->smb_buf_length; > >>> struct socket *ssocket = server->ssocket; > >>> > >>> if (ssocket == NULL) { > >>> cERROR(1, "need to reconnect in sendv here"); > >>> *** HERE *** WARN_ON_ONCE(1); > >>> return -ENOTSOCK; /* BB eventually add reconnect code here */ > >>> } > >>> > >>> A second warn-on when ENOTSOCK is perculated up to the calling stack > >>> a bit causes the other stack dumpage. I think the one above is root > >>> cause...need to figure out how to have it gracefully bail out and re-connect > >>> when it hits this state, as current code just calls this general loop over > >>> and over again. > >>> > >> > >> No, your warning is there, but it's Oopsing too: > >> > >>>>> ------------[ cut here ]------------ > >>>>> WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]() > >>>>> BUG: unable to handle kernel > >>>>> Hardware name: X8ST3 > >>>>> NULL pointer dereference > >>>>> Modules linked in: at 0000000000000020 > >>>>> be2iscsi > >>>>> IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e > >> > >> > >> ...smb_sendv is called by the "send" side which is generally a > >> userspace process. The oops happened on the receive side. cifsd called > >> kernel_recvmsg, and it looks like it passed in a NULL sock pointer. > >> > > > > I suspect that the following (untested) patch will fix this. I think > > the symptoms that you've seen are consistent with the patch > > description. Ben, would you be able to test this in your setup? This > > should at least prevent the oopses. > > > > ------------------[snip]-------------------- > > > > [PATCH] cifs: don't allow cifs_reconnect to exit with NULL socket pointer > > > > It's possible for the following set of events to happen: > > > > cifsd calls cifs_reconnect which reconnects the socket. A userspace > > process then calls cifs_negotiate_protocol to handle the NEGOTIATE and > > gets a reply. But, while processing the reply, cifsd calls > > cifs_reconnect again. Eventually the GlobalMid_Lock is dropped and the > > reply from the earlier NEGOTIATE completes and the tcpStatus is set to > > CifsGood. cifs_reconnect then goes through and closes the socket and sets the > > pointer to zero, but because the status is now CifsGood, the new socket > > is not created and cifs_reconnect exits with the socket pointer set to > > NULL. > > > > Fix this by only setting the tcpStatus to CifsGood if the tcpStatus is > > CifsNeedNegotiate, and by making sure that generic_ip_connect is always > > called at least once in cifs_reconnect. > > > > Note that this is not a perfect fix for this issue. It's still possible > > that the NEGOTIATE reply is handled after the socket has been closed and > > reconnected. In that case, the socket state will look correct but it no > > NEGOTIATE was performed on it. In that situation though the server > > should just shut down the socket on the next attempted send, rather > > than causing the oops that occurs today. > > > > Reported-by: Ben Greear<greearb@candelatech.com> > > Signed-off-by: Jeff Layton<jlayton@redhat.com> > > --- > > fs/cifs/connect.c | 6 +++--- > > 1 files changed, 3 insertions(+), 3 deletions(-) > > > > diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c > > index 84c7307..8bb55bc 100644 > > --- a/fs/cifs/connect.c > > +++ b/fs/cifs/connect.c > > @@ -152,7 +152,7 @@ cifs_reconnect(struct TCP_Server_Info *server) > > mid_entry->callback(mid_entry); > > } > > > > - while (server->tcpStatus == CifsNeedReconnect) { > > + do { > > try_to_freeze(); > > > > /* we should try only the port we connected to before */ > > @@ -167,7 +167,7 @@ cifs_reconnect(struct TCP_Server_Info *server) > > server->tcpStatus = CifsNeedNegotiate; > > spin_unlock(&GlobalMid_Lock); > > } > > - } > > + } while (server->tcpStatus == CifsNeedReconnect); > > > > return rc; > > } > > @@ -3371,7 +3371,7 @@ int cifs_negotiate_protocol(unsigned int xid, struct cifs_ses *ses) > > } > > if (rc == 0) { > > spin_lock(&GlobalMid_Lock); > > - if (server->tcpStatus != CifsExiting) > > + if (server->tcpStatus == CifsNeedNegotiate) > > server->tcpStatus = CifsGood; > > else > > rc = -EHOSTDOWN; > > > This has some merge issues on 3.6.38.8: > > > <<<<<<< > while ((server->tcpStatus != CifsExiting) && > (server->tcpStatus != CifsGood)) { > ======= > do { > >>>>>>> > > Should I keep your comparison for tcpStatus == CifsNeedReconnect > instead of these != comparisons above? > > > Thanks, > Ben > No, I think you probably just want to take patch fd88ce9313 too, which should fix up the merge conflict.
On 06/06/2011 09:51 AM, Jeff Layton wrote: > On Mon, 06 Jun 2011 09:47:40 -0700 > Ben Greear<greearb@candelatech.com> wrote: > >> On 06/06/2011 06:45 AM, Jeff Layton wrote: >>> On Sat, 4 Jun 2011 07:19:23 -0400 >>> [PATCH] cifs: don't allow cifs_reconnect to exit with NULL socket pointer >>> >>> It's possible for the following set of events to happen: >>> >>> cifsd calls cifs_reconnect which reconnects the socket. A userspace >>> process then calls cifs_negotiate_protocol to handle the NEGOTIATE and >>> gets a reply. But, while processing the reply, cifsd calls >>> cifs_reconnect again. Eventually the GlobalMid_Lock is dropped and the >>> reply from the earlier NEGOTIATE completes and the tcpStatus is set to >>> CifsGood. cifs_reconnect then goes through and closes the socket and sets the >>> pointer to zero, but because the status is now CifsGood, the new socket >>> is not created and cifs_reconnect exits with the socket pointer set to >>> NULL. >>> >>> Fix this by only setting the tcpStatus to CifsGood if the tcpStatus is >>> CifsNeedNegotiate, and by making sure that generic_ip_connect is always >>> called at least once in cifs_reconnect. >>> >>> Note that this is not a perfect fix for this issue. It's still possible >>> that the NEGOTIATE reply is handled after the socket has been closed and >>> reconnected. In that case, the socket state will look correct but it no >>> NEGOTIATE was performed on it. In that situation though the server >>> should just shut down the socket on the next attempted send, rather >>> than causing the oops that occurs today. >>> >>> Reported-by: Ben Greear<greearb@candelatech.com> >>> Signed-off-by: Jeff Layton<jlayton@redhat.com> >>> --- >>> fs/cifs/connect.c | 6 +++--- >>> 1 files changed, 3 insertions(+), 3 deletions(-) >>> >>> diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c >>> index 84c7307..8bb55bc 100644 >>> --- a/fs/cifs/connect.c >>> +++ b/fs/cifs/connect.c >>> @@ -152,7 +152,7 @@ cifs_reconnect(struct TCP_Server_Info *server) >>> mid_entry->callback(mid_entry); >>> } >>> >>> - while (server->tcpStatus == CifsNeedReconnect) { >>> + do { >>> try_to_freeze(); >>> >>> /* we should try only the port we connected to before */ >>> @@ -167,7 +167,7 @@ cifs_reconnect(struct TCP_Server_Info *server) >>> server->tcpStatus = CifsNeedNegotiate; >>> spin_unlock(&GlobalMid_Lock); >>> } >>> - } >>> + } while (server->tcpStatus == CifsNeedReconnect); >>> >>> return rc; >>> } >>> @@ -3371,7 +3371,7 @@ int cifs_negotiate_protocol(unsigned int xid, struct cifs_ses *ses) >>> } >>> if (rc == 0) { >>> spin_lock(&GlobalMid_Lock); >>> - if (server->tcpStatus != CifsExiting) >>> + if (server->tcpStatus == CifsNeedNegotiate) >>> server->tcpStatus = CifsGood; >>> else >>> rc = -EHOSTDOWN; >> >> >> This has some merge issues on 3.6.38.8: >> >> >> <<<<<<< >> while ((server->tcpStatus != CifsExiting)&& >> (server->tcpStatus != CifsGood)) { >> ======= >> do { >> >>>>>>> >> >> Should I keep your comparison for tcpStatus == CifsNeedReconnect >> instead of these != comparisons above? >> >> >> Thanks, >> Ben >> > > No, I think you probably just want to take patch fd88ce9313 too, which > should fix up the merge conflict. Ok, I've applied those two..we'll start testing with these patches today. Might take a while before we are certain the fix works, as this isn't usually easy or fast to reproduce. Thanks, Ben
Ben, Thanks - this may be a very rare case - hard to prove without your testing but it looks like Jeff's patch makes sense. On Mon, Jun 6, 2011 at 12:22 PM, Ben Greear <greearb@candelatech.com> wrote: > On 06/06/2011 09:51 AM, Jeff Layton wrote: >> >> On Mon, 06 Jun 2011 09:47:40 -0700 >> Ben Greear<greearb@candelatech.com> wrote: >> >>> On 06/06/2011 06:45 AM, Jeff Layton wrote: >>>> >>>> On Sat, 4 Jun 2011 07:19:23 -0400 > >>>> [PATCH] cifs: don't allow cifs_reconnect to exit with NULL socket >>>> pointer >>>> >>>> It's possible for the following set of events to happen: >>>> >>>> cifsd calls cifs_reconnect which reconnects the socket. A userspace >>>> process then calls cifs_negotiate_protocol to handle the NEGOTIATE and >>>> gets a reply. But, while processing the reply, cifsd calls >>>> cifs_reconnect again. Eventually the GlobalMid_Lock is dropped and the >>>> reply from the earlier NEGOTIATE completes and the tcpStatus is set to >>>> CifsGood. cifs_reconnect then goes through and closes the socket and >>>> sets the >>>> pointer to zero, but because the status is now CifsGood, the new socket >>>> is not created and cifs_reconnect exits with the socket pointer set to >>>> NULL. >>>> >>>> Fix this by only setting the tcpStatus to CifsGood if the tcpStatus is >>>> CifsNeedNegotiate, and by making sure that generic_ip_connect is always >>>> called at least once in cifs_reconnect. >>>> >>>> Note that this is not a perfect fix for this issue. It's still possible >>>> that the NEGOTIATE reply is handled after the socket has been closed and >>>> reconnected. In that case, the socket state will look correct but it no >>>> NEGOTIATE was performed on it. In that situation though the server >>>> should just shut down the socket on the next attempted send, rather >>>> than causing the oops that occurs today. >>>> >>>> Reported-by: Ben Greear<greearb@candelatech.com> >>>> Signed-off-by: Jeff Layton<jlayton@redhat.com> >>>> --- >>>> fs/cifs/connect.c | 6 +++--- >>>> 1 files changed, 3 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c >>>> index 84c7307..8bb55bc 100644 >>>> --- a/fs/cifs/connect.c >>>> +++ b/fs/cifs/connect.c >>>> @@ -152,7 +152,7 @@ cifs_reconnect(struct TCP_Server_Info *server) >>>> mid_entry->callback(mid_entry); >>>> } >>>> >>>> - while (server->tcpStatus == CifsNeedReconnect) { >>>> + do { >>>> try_to_freeze(); >>>> >>>> /* we should try only the port we connected to before */ >>>> @@ -167,7 +167,7 @@ cifs_reconnect(struct TCP_Server_Info *server) >>>> server->tcpStatus = CifsNeedNegotiate; >>>> spin_unlock(&GlobalMid_Lock); >>>> } >>>> - } >>>> + } while (server->tcpStatus == CifsNeedReconnect); >>>> >>>> return rc; >>>> } >>>> @@ -3371,7 +3371,7 @@ int cifs_negotiate_protocol(unsigned int xid, >>>> struct cifs_ses *ses) >>>> } >>>> if (rc == 0) { >>>> spin_lock(&GlobalMid_Lock); >>>> - if (server->tcpStatus != CifsExiting) >>>> + if (server->tcpStatus == CifsNeedNegotiate) >>>> server->tcpStatus = CifsGood; >>>> else >>>> rc = -EHOSTDOWN; >>> >>> >>> This has some merge issues on 3.6.38.8: >>> >>> >>> <<<<<<< >>> while ((server->tcpStatus != CifsExiting)&& >>> (server->tcpStatus != CifsGood)) { >>> ======= >>> do { >>> >>>>>>> >>> >>> Should I keep your comparison for tcpStatus == CifsNeedReconnect >>> instead of these != comparisons above? >>> >>> >>> Thanks, >>> Ben >>> >> >> No, I think you probably just want to take patch fd88ce9313 too, which >> should fix up the merge conflict. > > Ok, I've applied those two..we'll start testing with these patches > today. Might take a while before we are certain the fix works, as > this isn't usually easy or fast to reproduce. > > Thanks, > Ben > > -- > Ben Greear <greearb@candelatech.com> > Candela Technologies Inc http://www.candelatech.com > >
On 06/06/2011 06:00 PM, Steve French wrote: > Ben, > Thanks - this may be a very rare case - hard to prove without your testing > but it looks like Jeff's patch makes sense. We've had 3+ days of clean failover testing, so I think that patch did solve the problem. You are welcome to add my tested-by if you want. Thanks, Ben > > On Mon, Jun 6, 2011 at 12:22 PM, Ben Greear<greearb@candelatech.com> wrote: >> On 06/06/2011 09:51 AM, Jeff Layton wrote: >>> >>> On Mon, 06 Jun 2011 09:47:40 -0700 >>> Ben Greear<greearb@candelatech.com> wrote: >>> >>>> On 06/06/2011 06:45 AM, Jeff Layton wrote: >>>>> >>>>> On Sat, 4 Jun 2011 07:19:23 -0400 >> >>>>> [PATCH] cifs: don't allow cifs_reconnect to exit with NULL socket >>>>> pointer >>>>> >>>>> It's possible for the following set of events to happen: >>>>> >>>>> cifsd calls cifs_reconnect which reconnects the socket. A userspace >>>>> process then calls cifs_negotiate_protocol to handle the NEGOTIATE and >>>>> gets a reply. But, while processing the reply, cifsd calls >>>>> cifs_reconnect again. Eventually the GlobalMid_Lock is dropped and the >>>>> reply from the earlier NEGOTIATE completes and the tcpStatus is set to >>>>> CifsGood. cifs_reconnect then goes through and closes the socket and >>>>> sets the >>>>> pointer to zero, but because the status is now CifsGood, the new socket >>>>> is not created and cifs_reconnect exits with the socket pointer set to >>>>> NULL. >>>>> >>>>> Fix this by only setting the tcpStatus to CifsGood if the tcpStatus is >>>>> CifsNeedNegotiate, and by making sure that generic_ip_connect is always >>>>> called at least once in cifs_reconnect. >>>>> >>>>> Note that this is not a perfect fix for this issue. It's still possible >>>>> that the NEGOTIATE reply is handled after the socket has been closed and >>>>> reconnected. In that case, the socket state will look correct but it no >>>>> NEGOTIATE was performed on it. In that situation though the server >>>>> should just shut down the socket on the next attempted send, rather >>>>> than causing the oops that occurs today. >>>>> >>>>> Reported-by: Ben Greear<greearb@candelatech.com> >>>>> Signed-off-by: Jeff Layton<jlayton@redhat.com> >>>>> --- >>>>> fs/cifs/connect.c | 6 +++--- >>>>> 1 files changed, 3 insertions(+), 3 deletions(-) >>>>> >>>>> diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c >>>>> index 84c7307..8bb55bc 100644 >>>>> --- a/fs/cifs/connect.c >>>>> +++ b/fs/cifs/connect.c >>>>> @@ -152,7 +152,7 @@ cifs_reconnect(struct TCP_Server_Info *server) >>>>> mid_entry->callback(mid_entry); >>>>> } >>>>> >>>>> - while (server->tcpStatus == CifsNeedReconnect) { >>>>> + do { >>>>> try_to_freeze(); >>>>> >>>>> /* we should try only the port we connected to before */ >>>>> @@ -167,7 +167,7 @@ cifs_reconnect(struct TCP_Server_Info *server) >>>>> server->tcpStatus = CifsNeedNegotiate; >>>>> spin_unlock(&GlobalMid_Lock); >>>>> } >>>>> - } >>>>> + } while (server->tcpStatus == CifsNeedReconnect); >>>>> >>>>> return rc; >>>>> } >>>>> @@ -3371,7 +3371,7 @@ int cifs_negotiate_protocol(unsigned int xid, >>>>> struct cifs_ses *ses) >>>>> } >>>>> if (rc == 0) { >>>>> spin_lock(&GlobalMid_Lock); >>>>> - if (server->tcpStatus != CifsExiting) >>>>> + if (server->tcpStatus == CifsNeedNegotiate) >>>>> server->tcpStatus = CifsGood; >>>>> else >>>>> rc = -EHOSTDOWN; >>>> >>>> >>>> This has some merge issues on 3.6.38.8: >>>> >>>> >>>> <<<<<<< >>>> while ((server->tcpStatus != CifsExiting)&& >>>> (server->tcpStatus != CifsGood)) { >>>> ======= >>>> do { >>>> >>>>>>> >>>> >>>> Should I keep your comparison for tcpStatus == CifsNeedReconnect >>>> instead of these != comparisons above? >>>> >>>> >>>> Thanks, >>>> Ben >>>> >>> >>> No, I think you probably just want to take patch fd88ce9313 too, which >>> should fix up the merge conflict. >> >> Ok, I've applied those two..we'll start testing with these patches >> today. Might take a while before we are certain the fix works, as >> this isn't usually easy or fast to reproduce. >> >> Thanks, >> Ben >> >> -- >> Ben Greear<greearb@candelatech.com> >> Candela Technologies Inc http://www.candelatech.com >> >> > > >
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c index 84c7307..8bb55bc 100644 --- a/fs/cifs/connect.c +++ b/fs/cifs/connect.c @@ -152,7 +152,7 @@ cifs_reconnect(struct TCP_Server_Info *server) mid_entry->callback(mid_entry); } - while (server->tcpStatus == CifsNeedReconnect) { + do { try_to_freeze(); /* we should try only the port we connected to before */ @@ -167,7 +167,7 @@ cifs_reconnect(struct TCP_Server_Info *server) server->tcpStatus = CifsNeedNegotiate; spin_unlock(&GlobalMid_Lock); } - } + } while (server->tcpStatus == CifsNeedReconnect); return rc; } @@ -3371,7 +3371,7 @@ int cifs_negotiate_protocol(unsigned int xid, struct cifs_ses *ses) } if (rc == 0) { spin_lock(&GlobalMid_Lock); - if (server->tcpStatus != CifsExiting) + if (server->tcpStatus == CifsNeedNegotiate) server->tcpStatus = CifsGood; else rc = -EHOSTDOWN;