From patchwork Thu Apr 11 15:30:30 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Michael S. Tsirkin" X-Patchwork-Id: 2429391 X-Patchwork-Delegate: bhelgaas@google.com Return-Path: X-Original-To: patchwork-linux-pci@patchwork.kernel.org Delivered-To: patchwork-process-083081@patchwork2.kernel.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by patchwork2.kernel.org (Postfix) with ESMTP id E87C6DF230 for ; Thu, 11 Apr 2013 15:30:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S936311Ab3DKPaw (ORCPT ); Thu, 11 Apr 2013 11:30:52 -0400 Received: from mx1.redhat.com ([209.132.183.28]:45009 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S936175Ab3DKPaw (ORCPT ); Thu, 11 Apr 2013 11:30:52 -0400 Received: from int-mx02.intmail.prod.int.phx2.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id r3BFUZTP017788 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Thu, 11 Apr 2013 11:30:35 -0400 Received: from redhat.com (vpn1-7-227.ams2.redhat.com [10.36.7.227]) by int-mx02.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with SMTP id r3BFUVGU012517; Thu, 11 Apr 2013 11:30:31 -0400 Date: Thu, 11 Apr 2013 18:30:30 +0300 From: "Michael S. Tsirkin" To: Or Gerlitz Cc: Ming Lei , Greg Kroah-Hartman , David Miller , Roland Dreier , netdev , Yan Burman , Jack Morgenstein , Bjorn Helgaas , linux-pci@vger.kernel.org, Tejun Heo Subject: [PATCH repost for-3.9] pci: avoid work_on_cpu for nested SRIOV probes Message-ID: <20130411153030.GA22743@redhat.com> MIME-Version: 1.0 Content-Disposition: inline X-Scanned-By: MIMEDefang 2.67 on 10.5.11.12 Sender: linux-pci-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org The following lockdep warning is reported to trigger since 3.9-rc1: --- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ============================================= [ INFO: possible recursive locking detected ] 3.9.0-rc1 #96 Not tainted --------------------------------------------- kworker/0:1/734 is trying to acquire lock: ((&wfc.work)){+.+.+.}, at: [] flush_work+0x0/0x250 but task is already holding lock: ((&wfc.work)){+.+.+.}, at: [] process_one_work+0x162/0x4c0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock((&wfc.work)); lock((&wfc.work)); *** DEADLOCK *** May be due to missing lock nesting notation 3 locks held by kworker/0:1/734: #0: (events){.+.+.+}, at: [] process_one_work+0x162/0x4c0 #1: ((&wfc.work)){+.+.+.}, at: [] process_one_work+0x162/0x4c0 #2: (&__lockdep_no_validate__){......}, at: [] device_attach+0x25/0xb0 stack backtrace: Pid: 734, comm: kworker/0:1 Not tainted 3.9.0-rc1 #96 Call Trace: [] validate_chain+0xdcc/0x11f0 [] __lock_acquire+0x440/0xc70 [] ? __lock_acquire+0x440/0xc70 [] lock_acquire+0x5a/0x70 [] ? wq_worker_waking_up+0x60/0x60 [] flush_work+0x45/0x250 [] ? wq_worker_waking_up+0x60/0x60 [] ? mark_held_locks+0x9e/0x130 [] ? queue_work_on+0x46/0x90 [] ? trace_hardirqs_on_caller+0xfd/0x190 [] ? trace_hardirqs_on+0xd/0x10 [] work_on_cpu+0x74/0x90 [] ? keventd_up+0x20/0x20 [] ? pci_pm_prepare+0x60/0x60 [] ? cpumask_next_and+0x23/0x40 [] pci_device_probe+0xba/0x110 [] ? driver_sysfs_add+0x7a/0xb0 [] driver_probe_device+0x8f/0x230 [] ? __driver_attach+0xb0/0xb0 [] __device_attach+0x4b/0x60 [] bus_for_each_drv+0x64/0x90 [] device_attach+0x98/0xb0 [] pci_bus_add_device+0x24/0x50 [] virtfn_add+0x240/0x3e0 [] ? _raw_spin_unlock_irqrestore+0x3d/0x80 [] pci_enable_sriov+0x23e/0x500 [] __mlx4_init_one+0x5da/0xce0 [mlx4_core] [] mlx4_init_one+0x2d/0x60 [mlx4_core] [] local_pci_probe+0x49/0x80 [] work_for_cpu_fn+0x13/0x20 [] process_one_work+0x1c8/0x4c0 [] ? process_one_work+0x162/0x4c0 [] worker_thread+0x30b/0x430 [] ? manage_workers+0x340/0x340 [] kthread+0xd6/0xe0 [] ? __init_kthread_worker+0x70/0x70 [] ret_from_fork+0x7c/0xb0 [] ? __init_kthread_worker+0x70/0x70 Reference: http://marc.info/?l=linux-netdev&m=136249690901892&w=2 The issue is that a driver, in it's probe function, calls pci_sriov_enable so a PF device probe causes VF probe (AKA nested probe). Each probe in pci_device_probe is (normally) run through work_on_cpu (this is to get the right numa node for memory allocated by the driver). In turn work_on_cpu does this internally: schedule_work_on(cpu, &wfc.work); flush_work(&wfc.work); So if you are running probe on CPU1, and cause another probe on the same CPU, this will try to flush workqueue from inside same workqueue which of course deadlocks. Nested probing might be tricky to get right generally. But for pci_sriov_enable, the situation is actually very simple: all VFs naturally have same affinity as the PF, and cpumask_any_and is actually same as cpumask_first_and, so it always gives us the same CPU. So let's just detect that, and run the probing for VFs locally without a workqueue. This is hardly elegant, but looks to me like an appropriate quick fix for 3.9. Tested-by: Or Gerlitz Signed-off-by: Michael S. Tsirkin --- Reposting due to missed Cc's. Sorry about the noise. diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index 1fa1e48..6eeb5ec 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -286,8 +286,8 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev, int cpu; get_online_cpus(); cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask); - if (cpu < nr_cpu_ids) + if (cpu != raw_smp_processor_id() && cpu < nr_cpu_ids) error = work_on_cpu(cpu, local_pci_probe, &ddi); else error = local_pci_probe(&ddi);