From patchwork Sun Nov 6 16:12:23 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tariq Toukan X-Patchwork-Id: 9413947 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id DF32E6022E for ; Sun, 6 Nov 2016 16:12:34 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D14A328C3E for ; Sun, 6 Nov 2016 16:12:34 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id C5A8D28D62; Sun, 6 Nov 2016 16:12:34 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.3 required=2.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, DKIM_SIGNED, FREEMAIL_FROM, RCVD_IN_DNSWL_HI, RCVD_IN_SORBS_SPAM, T_DKIM_INVALID autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3BE0A28C3E for ; Sun, 6 Nov 2016 16:12:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752316AbcKFQM3 (ORCPT ); Sun, 6 Nov 2016 11:12:29 -0500 Received: from mail-wm0-f41.google.com ([74.125.82.41]:38263 "EHLO mail-wm0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752305AbcKFQM0 (ORCPT ); Sun, 6 Nov 2016 11:12:26 -0500 Received: by mail-wm0-f41.google.com with SMTP id f82so76273662wmf.1 for ; Sun, 06 Nov 2016 08:12:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:to:references:cc:from:message-id:date:user-agent :mime-version:in-reply-to:content-transfer-encoding; bh=SoINmjBfmZ3tipnBwZAqs6Q8A41lJwzvffriBP2b4YU=; b=1Gxq+qnsNBsi0wxbZ67ry4P+TeNRFononr2fOyszpF+Fr4mAwPyy/qOrA+E3QmTxDG cjFC0VgQxpSsfGaiaqu+lbolpm9jUcAyLH7NOLjk8h8NJa401fV6QcfyVURo5BIY8sSa vNZPqGRHLkAP7Nlt62dKfNtr9QjuGZBB4RJRAE0NSMzyD9GFY73fHsHBNeOiKMd0yiWQ r9niJv0EaDRWR/O5B7ATPgFmN6QNe4GV2t+vLitLe+DPTubQ7JC/vfErterm06Q2wPKw 33m93K9wIGujHCaUzXaJucSfUxfq+PWsAxj/kDt6xQucMV7mML7FcQpGNsru3kZARt33 1hhA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:cc:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding; bh=SoINmjBfmZ3tipnBwZAqs6Q8A41lJwzvffriBP2b4YU=; b=lUTxMyQa78Jkqj84Q+WOPhOLHHk/jGiSCLuHGZiAfRD41W+xepVhbD4D5mtszcEP1h HlFNBSWdPdrV/lZZCVJFDI5yBCQAy/W2Ox3l10hTS7n+6Kp2+GQgWIkOeeZbzTj7hcej nawqVQT6qZpud1w/5Y15Cs8B5WS4IK/pPPHxP8bTbTSzYddppBXH4+KCdc3NYf5tMmWD FqoVPoXIWxeLuPqOi9FQYFpfDQRebyvBUK0bydQI6VoMOrhKNQfxh4QGOHy6NnY0s2gl k7tc0Pnp4KOSDHc4kyrcDiMCNF9iRsqRI6mLe71VcoDJbYNhNtFX0zClKKRbgoFoql8K moew== X-Gm-Message-State: ABUngvdugQvsZe/PfDuP1g5oNRTNO9mC1Bf2yjXbqrNoCDUrnr09LTxk+tqcttJE3+h3CA== X-Received: by 10.28.35.205 with SMTP id j196mr3337470wmj.62.1478448745307; Sun, 06 Nov 2016 08:12:25 -0800 (PST) Received: from [10.8.0.228] ([193.47.165.251]) by smtp.gmail.com with ESMTPSA id gk6sm25822074wjc.21.2016.11.06.08.12.24 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 06 Nov 2016 08:12:24 -0800 (PST) Subject: Re: Crash in mlx4 shutdown with 4.9-rc3 To: Leon Romanovsky , Steve Wise References: <01e001d236a7$e4c24160$ae46c420$@opengridcomputing.com> <20161105131513.GP3617@leon.nu> Cc: yishaih@mellanox.com, linux-rdma@vger.kernel.org, Majd Dibbiny , Tariq Toukan From: Tariq Toukan Message-ID: Date: Sun, 6 Nov 2016 18:12:23 +0200 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <20161105131513.GP3617@leon.nu> Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Hi Steve, On 05/11/2016 3:15 PM, Leon Romanovsky wrote: > On Fri, Nov 04, 2016 at 09:29:47AM -0500, Steve Wise wrote: >> Hey Yishai, Is this by chance a known bug having a pending fix somewhere? I'm >> seeing it frequently when shutting down. I'm using 4.9-rc3 with memory >> debugging enabled... > Hi Steve, > > We have a fix for this oops in our submission queue to netdev and > it is now in final stages of verification. Tariq is planning to submit > it on Sunday. > > Thanks This crash happens because the lifetime of mlx4_en_priv->mdev is shorter than that of struct net_device. One WA is to add a check of netif_device_present in dev_get_phys_port_id. Something like this: However, this causes other issues when combining with MTU change. In MTU change, netif_device_present returns false for a while, causing an unexpected failure of dev_get_phys_port_id. We're working on fixing this correctly, but that won't happen today. Regards, Tariq Toukan >> [59984.502834] mlx4_core 0000:81:00.0: mlx4_shutdown was called >> [59984.603599] mlx4_en 0000:81:00.0: removed PHC >> [59985.145590] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC >> [59985.151990] Modules linked in: uio_pci_generic uio iw_cxgb4 cxgb4 nvmet_rdma >> nvmet null_blk brd rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi >> scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib >> rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm dm_mirror dm_region_hash >> dm_log dm_mod intel_rapl iosf_mbi sb_edac edac_core x86_pkg_temp_thermal >> coretemp ext4 kvm jbd2 irqbypass crct10dif_pclmul crc32_pclmul >> ghash_clmulni_intel mbcache aesni_intel lrw gf128mul iTCO_wdt glue_helper mei_me >> iTCO_vendor_support ablk_helper cryptd mxm_wmi ipmi_si i2c_i801 lpc_ich mei sg >> nfsd mfd_core i2c_smbus ipmi_msghandler pcspkr shpchp auth_rpcgss wmi nfs_acl >> lockd grace sunrpc ip_tables xfs libcrc32c libcxgb mlx4_ib ib_core mlx4_en >> sd_mod drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm >> mlx4_core igb drm ahci libahci ptp libata crc32c_intel pps_core dca nvme >> i2c_algo_bit nvme_core i2c_core [last unloaded: cxgb4] >> [59985.239258] CPU: 30 PID: 10937 Comm: kworker/30:1 Not tainted >> 4.9.0-rc3-debug+ #2 >> [59985.246992] Hardware name: Supermicro X9DR3-F/X9DR3-F, BIOS 3.2a 07/09/2015 >> [59985.254098] Workqueue: events linkwatch_event >> [59985.258600] task: ffff88105312c6c0 task.stack: ffffc90020204000 >> [59985.264657] RIP: 0010:[] [] >> mlx4_en_get_phys_port_id+0x1a/0x50 [mlx4_en] >> [59985.274874] RSP: 0018:ffffc90020207c30 EFLAGS: 00010286 >> [59985.280312] RAX: 6b6b6b6b6b6b6b6b RBX: ffff881048c220c0 RCX: 0000000000000000 >> [59985.287582] RDX: 0000000000000001 RSI: ffffc90020207cb0 RDI: ffff881037020000 >> [59985.294844] RBP: ffffc90020207c30 R08: 00000000000005f0 R09: ffff88102017e752 >> [59985.302100] R10: ffff88085f4090c0 R11: ffff88102017e678 R12: ffff881037020000 >> [59985.309356] R13: ffff88102017e678 R14: 0000000000000000 R15: 0000000000000000 >> [59985.316608] FS: 0000000000000000(0000) GS:ffff881057580000(0000) >> knlGS:0000000000000000 >> [59985.324936] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [59985.330805] CR2: 00007fff8fd82ff8 CR3: 0000000001c07000 CR4: 00000000000406e0 >> [59985.338072] Stack: >> [59985.340219] ffffc90020207c40 ffffffff81587a6e ffffc90020207d00 >> ffffffff815a36ce >> [59985.347950] ffff881048c220c0 ffffc90020207cd7 0000000000000000 >> 0000000000000010 >> [59985.355684] 02000000ffffffff 000003e820000000 00000000000005dc >> 0000010000000000 >> [59985.363408] Call Trace: >> [59985.365994] [] dev_get_phys_port_id+0x1e/0x30 >> [59985.372123] [] rtnl_fill_ifinfo+0x4be/0xff0 >> [59985.378076] [] rtmsg_ifinfo_build_skb+0x73/0xe0 >> [59985.384377] [] rtmsg_ifinfo.part.27+0x16/0x50 >> [59985.390505] [] rtmsg_ifinfo+0x18/0x20 >> [59985.395940] [] netdev_state_change+0x46/0x50 >> [59985.401983] [] linkwatch_do_dev+0x38/0x50 >> [59985.407764] [] __linkwatch_run_queue+0xf5/0x170 >> [59985.414067] [] linkwatch_event+0x25/0x30 >> [59985.419764] [] process_one_work+0x152/0x400 >> [59985.425716] [] worker_thread+0x125/0x4b0 >> [59985.431409] [] ? rescuer_thread+0x350/0x350 >> [59985.437366] [] kthread+0xca/0xe0 >> [59985.442367] [] ? kthread_park+0x60/0x60 >> [59985.447978] [] ret_from_fork+0x25/0x30 >> [59985.453497] Code: f0 5d c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 >> 66 90 55 48 8b 87 c0 08 00 00 48 63 97 9c d5 00 00 48 89 e5 48 8b 00 <48> 8b 94 >> d0 58 02 00 00 48 85 d2 74 1c c6 46 20 08 31 c0 88 54 >> [59985.474081] RIP [] mlx4_en_get_phys_port_id+0x1a/0x50 >> [mlx4_en] >> [59985.481915] RSP >> [59985.485910] ---[ end trace 317937c8890959b8 ]--- >> [59990.228721] Kernel panic - not syncing: Fatal exception >> [59990.234181] Kernel Offset: disabled >> [59990.239944] ---[ end Kernel panic - not syncing: Fatal exception >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html --- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6601,6 +6601,8 @@ int dev_get_phys_port_id(struct net_device *dev, if (!ops->ndo_get_phys_port_id) return -EOPNOTSUPP; + if (!netif_device_present(dev)) + return -ENODEV; return ops->ndo_get_phys_port_id(dev, ppid); } EXPORT_SYMBOL(dev_get_phys_port_id);