From patchwork Mon Oct 31 10:44:52 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lukas Wunner X-Patchwork-Id: 9405255 X-Patchwork-Delegate: bhelgaas@google.com Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 9520F60234 for ; Mon, 31 Oct 2016 10:45:01 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8D3AC28F14 for ; Mon, 31 Oct 2016 10:45:01 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 7F15228F15; Mon, 31 Oct 2016 10:45:01 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4C30A28E75 for ; Mon, 31 Oct 2016 10:45:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933798AbcJaKo7 (ORCPT ); Mon, 31 Oct 2016 06:44:59 -0400 Received: from mailout3.hostsharing.net ([176.9.242.54]:35865 "EHLO mailout3.hostsharing.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933787AbcJaKo6 (ORCPT ); Mon, 31 Oct 2016 06:44:58 -0400 Received: from h08.hostsharing.net (h08.hostsharing.net [83.223.95.28]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mailout3.hostsharing.net (Postfix) with ESMTPS id 6DD36101E67ED; Mon, 31 Oct 2016 11:44:55 +0100 (CET) Received: from localhost (3-38-90-81.adsl.cmo.de [81.90.38.3]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by h08.hostsharing.net (Postfix) with ESMTPSA id 6937C6018958; Mon, 31 Oct 2016 11:44:53 +0100 (CET) X-Mailbox-Line: From c32129ab921613c41635a572356dc312df304199 Mon Sep 17 00:00:00 2001 Message-Id: From: Lukas Wunner Date: Mon, 31 Oct 2016 11:44:52 +0100 Subject: [PATCH] PCI: Equate surprise removal with channel error To: linux-pci@vger.kernel.org, Bjorn Helgaas , Keith Busch Cc: Andreas Noever , Siva Reddy Kallam , Prashant Sreedharan , Michael Chan Sender: linux-pci-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Surprise removal of an Apple Thunderbolt Gigabit Ethernet adapter results in a soft lockup because the tg3 driver busy-waits while trying to shut down all function blocks of the hot-removed device in tg3_abort_hw(). The soft lockup can be avoided if the network interface is shut down before unplugging the device, but that's easily forgotten and shouldn't be punished with an unusable machine. tg3_abort_hw() uses pci_channel_offline() to check the AER status of the device and refrains from accessing it if a channel error has occurred. Many other drivers follow the same pattern. We've just added an is_removed flag to struct pci_dev which enables detection of surprise removal. Checking this flag in pci_channel_offline() is sufficient to fix the soft lockup. Even hot removing the device while packets are in-flight works fine. Scrolling on the console still becomes sluggish for about 1 sec on unplug and there's an occassional error "hrtimer: interrupt took 192490781 ns". This seems to be caused by accesses to the device in the time window between unplug and removal of the driver, either by tg3_msi() or tg3_poll(). It may be possible to mitigate this effect by checking the is_removed flag in both of them and by setting the is_removed flag quicker in pciehp. However on macOS, when hot removing the adapter while audio is playing, a sub-second interruption of playback is audible, suggesting that it may be impossible to make this perfectly seamless. Stacktrace of the soft lockup for posterity: NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:2:299] ... Workqueue: pciehp-4 pciehp_power_thread RIP: 0010:[] [] tg3_read32+0xd/0x10 [tg3] ... Call Trace: [] ? tg3_stop_block.constprop.126+0x80/0x110 [tg3] [] ? tg3_abort_hw+0x68/0x2f0 [tg3] [] ? tg3_halt+0x2d/0x180 [tg3] [] ? tg3_stop+0x157/0x210 [tg3] [] ? tg3_close+0x2b/0xe0 [tg3] [] ? __dev_close_many+0x84/0xd0 [] ? dev_close_many+0x74/0x100 [] ? rollback_registered_many+0xfb/0x2e0 [] ? rollback_registered+0x29/0x40 [] ? unregister_netdevice_queue+0x40/0x90 [] ? unregister_netdev+0x18/0x20 [] ? tg3_remove_one+0x8b/0x130 [tg3] [] ? pci_device_remove+0x36/0xb0 [] ? __device_release_driver+0x9a/0x140 [] ? device_release_driver+0x1e/0x30 [] ? pci_stop_bus_device+0x84/0xa0 [] ? pci_stop_bus_device+0x2b/0xa0 [] ? pci_stop_bus_device+0x2b/0xa0 [] ? pci_stop_and_remove_bus_device+0xe/0x20 [] ? pciehp_unconfigure_device+0x9a/0x180 [] ? pciehp_disable_slot+0x3f/0xb0 [] ? pciehp_power_thread+0x85/0xa0 [] ? process_one_work+0x19f/0x3d0 [] ? worker_thread+0x4d/0x450 [] ? pci_stop_bus_device+0x84/0xa0 [] ? pci_stop_bus_device+0x2b/0xa0 [] ? pci_stop_bus_device+0x2b/0xa0 [] ? pci_stop_and_remove_bus_device+0xe/0x20 [] ? pciehp_unconfigure_device+0x9a/0x180 [] ? pciehp_disable_slot+0x3f/0xb0 [] ? pciehp_power_thread+0x85/0xa0 [] ? process_one_work+0x19f/0x3d0 [] ? worker_thread+0x4d/0x450 [] ? process_one_work+0x3d0/0x3d0 [] ? kthread+0xbd/0xe0 [] ? kthread_create_on_node+0x170/0x170 [] ? ret_from_fork+0x3f/0x70 [] ? kthread_create_on_node+0x170/0x170 Cc: Bjorn Helgaas Cc: Keith Busch Cc: Andreas Noever Cc: Siva Reddy Kallam Cc: Prashant Sreedharan Cc: Michael Chan Signed-off-by: Lukas Wunner --- Depends on Keith Busch's patch "pci: Add is_removed state": https://patchwork.ozlabs.org/patch/688659/ include/linux/pci.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/pci.h b/include/linux/pci.h index 1f39500..af4178e 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -406,7 +406,7 @@ struct pci_dev *pci_alloc_dev(struct pci_bus *bus); static inline int pci_channel_offline(struct pci_dev *pdev) { - return (pdev->error_state != pci_channel_io_normal); + return pdev->error_state != pci_channel_io_normal || pdev->is_removed; } static inline int pci_set_removed(struct pci_dev *pdev, void *unused)