diff mbox

[PATCHv2,for-3.9] pci: avoid work_on_cpu for nested SRIOV probes

Message ID 20130418200855.GA24086@redhat.com (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Michael S. Tsirkin April 18, 2013, 8:08 p.m. UTC
The following lockdep report triggers since 3.9-rc1:

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Bjorn Helgaas April 18, 2013, 9:40 p.m. UTC | #1
On Thu, Apr 18, 2013 at 2:08 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> The following lockdep report triggers since 3.9-rc1:
>
> =============================================
> [ INFO: possible recursive locking detected ]
> 3.9.0-rc1 #96 Not tainted
> ---------------------------------------------
> kworker/0:1/734 is trying to acquire lock:
>  ((&wfc.work)){+.+.+.}, at: [<ffffffff81066cb0>] flush_work+0x0/0x250
>
> but task is already holding lock:
>  ((&wfc.work)){+.+.+.}, at: [<ffffffff81064352>]
> process_one_work+0x162/0x4c0
>
> other info that might help us debug this:
>  Possible unsafe locking scenario:
>
>        CPU0
>        ----
>   lock((&wfc.work));
>   lock((&wfc.work));
>
>  *** DEADLOCK ***
>
>  May be due to missing lock nesting notation
>
> 3 locks held by kworker/0:1/734:
>  #0:  (events){.+.+.+}, at: [<ffffffff81064352>]
> process_one_work+0x162/0x4c0
>  #1:  ((&wfc.work)){+.+.+.}, at: [<ffffffff81064352>]
> process_one_work+0x162/0x4c0
>  #2:  (&__lockdep_no_validate__){......}, at: [<ffffffff812db225>]
> device_attach+0x25/0xb0
>
> stack backtrace:
> Pid: 734, comm: kworker/0:1 Not tainted 3.9.0-rc1 #96
> Call Trace:
>  [<ffffffff810948ec>] validate_chain+0xdcc/0x11f0
>  [<ffffffff81095150>] __lock_acquire+0x440/0xc70
>  [<ffffffff81095150>] ? __lock_acquire+0x440/0xc70
>  [<ffffffff810959da>] lock_acquire+0x5a/0x70
>  [<ffffffff81066cb0>] ? wq_worker_waking_up+0x60/0x60
>  [<ffffffff81066cf5>] flush_work+0x45/0x250
>  [<ffffffff81066cb0>] ? wq_worker_waking_up+0x60/0x60
>  [<ffffffff810922be>] ? mark_held_locks+0x9e/0x130
>  [<ffffffff81066a96>] ? queue_work_on+0x46/0x90
>  [<ffffffff810925dd>] ? trace_hardirqs_on_caller+0xfd/0x190
>  [<ffffffff8109267d>] ? trace_hardirqs_on+0xd/0x10
>  [<ffffffff81066f74>] work_on_cpu+0x74/0x90
>  [<ffffffff81063820>] ? keventd_up+0x20/0x20
>  [<ffffffff8121fd30>] ? pci_pm_prepare+0x60/0x60
>  [<ffffffff811f9293>] ? cpumask_next_and+0x23/0x40
>  [<ffffffff81220a1a>] pci_device_probe+0xba/0x110
>  [<ffffffff812dadca>] ? driver_sysfs_add+0x7a/0xb0
>  [<ffffffff812daf1f>] driver_probe_device+0x8f/0x230
>  [<ffffffff812db170>] ? __driver_attach+0xb0/0xb0
>  [<ffffffff812db1bb>] __device_attach+0x4b/0x60
>  [<ffffffff812d9314>] bus_for_each_drv+0x64/0x90
>  [<ffffffff812db298>] device_attach+0x98/0xb0
>  [<ffffffff81218474>] pci_bus_add_device+0x24/0x50
>  [<ffffffff81232e80>] virtfn_add+0x240/0x3e0
>  [<ffffffff8146ce3d>] ? _raw_spin_unlock_irqrestore+0x3d/0x80
>  [<ffffffff812333be>] pci_enable_sriov+0x23e/0x500
>  [<ffffffffa011fa1a>] __mlx4_init_one+0x5da/0xce0 [mlx4_core]
>  [<ffffffffa012016d>] mlx4_init_one+0x2d/0x60 [mlx4_core]
>  [<ffffffff8121fd79>] local_pci_probe+0x49/0x80
>  [<ffffffff81063833>] work_for_cpu_fn+0x13/0x20
>  [<ffffffff810643b8>] process_one_work+0x1c8/0x4c0
>  [<ffffffff81064352>] ? process_one_work+0x162/0x4c0
>  [<ffffffff81064cfb>] worker_thread+0x30b/0x430
>  [<ffffffff810649f0>] ? manage_workers+0x340/0x340
>  [<ffffffff8106cea6>] kthread+0xd6/0xe0
>  [<ffffffff8106cdd0>] ? __init_kthread_worker+0x70/0x70
>  [<ffffffff8146daac>] ret_from_fork+0x7c/0xb0
>  [<ffffffff8106cdd0>] ? __init_kthread_worker+0x70/0x70
>
> The issue is that a driver, in it's probe function, calls
> pci_sriov_enable so a PF device probe causes VF probe (AKA nested
> probe).  Each probe in pci_device_probe which is (normally) run through
> work_on_cpu (this is to get the right numa node for memory allocated by
> the driver).  In turn work_on_cpu does this internally:
>
>         schedule_work_on(cpu, &wfc.work);
>         flush_work(&wfc.work);
>
> So if you are running probe on CPU1, and cause another
> probe on the same CPU, this will try to flush
> workqueue from inside same workqueue which causes
> a lockep warning.
>
> Nested probing might be tricky to get right generally.
>
> But for pci_sriov_enable, the situation is actually very simple: all VFs
> naturally have same affinity as the PF, and cpumask_any_and is actually
> same as cpumask_first_and, so it always gives us the same CPU.
> So let's just detect that, and run the probing for VFs locally without a
> workqueue.
>
> This is hardly elegant, but looks to me like an appropriate quick fix
> for 3.9.
>
> Tested-by: Or Gerlitz <ogerlitz@mellanox.com>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Acked-by: Tejun Heo <tj@kernel.org>

Thanks, Michael.  I put this in my for-linus branch:

http://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/log/?h=for-linus

I'll send a pull request to Linus today.

Bjorn

> ---
>
> Changes from v1:
>     - clarified commit log and added Ack by Tejun Heo
>       patch is unchanged.
>
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 1fa1e48..6eeb5ec 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -286,8 +286,8 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
>                 int cpu;
>
>                 get_online_cpus();
>                 cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask);
> -               if (cpu < nr_cpu_ids)
> +               if (cpu != raw_smp_processor_id() && cpu < nr_cpu_ids)
>                         error = work_on_cpu(cpu, local_pci_probe, &ddi);
>                 else
>                         error = local_pci_probe(&ddi);
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas April 18, 2013, 9:57 p.m. UTC | #2
On Thu, Apr 18, 2013 at 3:40 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Thu, Apr 18, 2013 at 2:08 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> The following lockdep report triggers since 3.9-rc1:
>>
>> =============================================
>> [ INFO: possible recursive locking detected ]
>> 3.9.0-rc1 #96 Not tainted
>> ---------------------------------------------
>> kworker/0:1/734 is trying to acquire lock:
>>  ((&wfc.work)){+.+.+.}, at: [<ffffffff81066cb0>] flush_work+0x0/0x250
>>
>> but task is already holding lock:
>>  ((&wfc.work)){+.+.+.}, at: [<ffffffff81064352>]
>> process_one_work+0x162/0x4c0
>>
>> other info that might help us debug this:
>>  Possible unsafe locking scenario:
>>
>>        CPU0
>>        ----
>>   lock((&wfc.work));
>>   lock((&wfc.work));
>>
>>  *** DEADLOCK ***
>>
>>  May be due to missing lock nesting notation
>>
>> 3 locks held by kworker/0:1/734:
>>  #0:  (events){.+.+.+}, at: [<ffffffff81064352>]
>> process_one_work+0x162/0x4c0
>>  #1:  ((&wfc.work)){+.+.+.}, at: [<ffffffff81064352>]
>> process_one_work+0x162/0x4c0
>>  #2:  (&__lockdep_no_validate__){......}, at: [<ffffffff812db225>]
>> device_attach+0x25/0xb0
>>
>> stack backtrace:
>> Pid: 734, comm: kworker/0:1 Not tainted 3.9.0-rc1 #96
>> Call Trace:
>>  [<ffffffff810948ec>] validate_chain+0xdcc/0x11f0
>>  [<ffffffff81095150>] __lock_acquire+0x440/0xc70
>>  [<ffffffff81095150>] ? __lock_acquire+0x440/0xc70
>>  [<ffffffff810959da>] lock_acquire+0x5a/0x70
>>  [<ffffffff81066cb0>] ? wq_worker_waking_up+0x60/0x60
>>  [<ffffffff81066cf5>] flush_work+0x45/0x250
>>  [<ffffffff81066cb0>] ? wq_worker_waking_up+0x60/0x60
>>  [<ffffffff810922be>] ? mark_held_locks+0x9e/0x130
>>  [<ffffffff81066a96>] ? queue_work_on+0x46/0x90
>>  [<ffffffff810925dd>] ? trace_hardirqs_on_caller+0xfd/0x190
>>  [<ffffffff8109267d>] ? trace_hardirqs_on+0xd/0x10
>>  [<ffffffff81066f74>] work_on_cpu+0x74/0x90
>>  [<ffffffff81063820>] ? keventd_up+0x20/0x20
>>  [<ffffffff8121fd30>] ? pci_pm_prepare+0x60/0x60
>>  [<ffffffff811f9293>] ? cpumask_next_and+0x23/0x40
>>  [<ffffffff81220a1a>] pci_device_probe+0xba/0x110
>>  [<ffffffff812dadca>] ? driver_sysfs_add+0x7a/0xb0
>>  [<ffffffff812daf1f>] driver_probe_device+0x8f/0x230
>>  [<ffffffff812db170>] ? __driver_attach+0xb0/0xb0
>>  [<ffffffff812db1bb>] __device_attach+0x4b/0x60
>>  [<ffffffff812d9314>] bus_for_each_drv+0x64/0x90
>>  [<ffffffff812db298>] device_attach+0x98/0xb0
>>  [<ffffffff81218474>] pci_bus_add_device+0x24/0x50
>>  [<ffffffff81232e80>] virtfn_add+0x240/0x3e0
>>  [<ffffffff8146ce3d>] ? _raw_spin_unlock_irqrestore+0x3d/0x80
>>  [<ffffffff812333be>] pci_enable_sriov+0x23e/0x500
>>  [<ffffffffa011fa1a>] __mlx4_init_one+0x5da/0xce0 [mlx4_core]
>>  [<ffffffffa012016d>] mlx4_init_one+0x2d/0x60 [mlx4_core]
>>  [<ffffffff8121fd79>] local_pci_probe+0x49/0x80
>>  [<ffffffff81063833>] work_for_cpu_fn+0x13/0x20
>>  [<ffffffff810643b8>] process_one_work+0x1c8/0x4c0
>>  [<ffffffff81064352>] ? process_one_work+0x162/0x4c0
>>  [<ffffffff81064cfb>] worker_thread+0x30b/0x430
>>  [<ffffffff810649f0>] ? manage_workers+0x340/0x340
>>  [<ffffffff8106cea6>] kthread+0xd6/0xe0
>>  [<ffffffff8106cdd0>] ? __init_kthread_worker+0x70/0x70
>>  [<ffffffff8146daac>] ret_from_fork+0x7c/0xb0
>>  [<ffffffff8106cdd0>] ? __init_kthread_worker+0x70/0x70
>>
>> The issue is that a driver, in it's probe function, calls
>> pci_sriov_enable so a PF device probe causes VF probe (AKA nested
>> probe).  Each probe in pci_device_probe which is (normally) run through
>> work_on_cpu (this is to get the right numa node for memory allocated by
>> the driver).  In turn work_on_cpu does this internally:
>>
>>         schedule_work_on(cpu, &wfc.work);
>>         flush_work(&wfc.work);
>>
>> So if you are running probe on CPU1, and cause another
>> probe on the same CPU, this will try to flush
>> workqueue from inside same workqueue which causes
>> a lockep warning.
>>
>> Nested probing might be tricky to get right generally.
>>
>> But for pci_sriov_enable, the situation is actually very simple: all VFs
>> naturally have same affinity as the PF, and cpumask_any_and is actually
>> same as cpumask_first_and, so it always gives us the same CPU.
>> So let's just detect that, and run the probing for VFs locally without a
>> workqueue.
>>
>> This is hardly elegant, but looks to me like an appropriate quick fix
>> for 3.9.
>>
>> Tested-by: Or Gerlitz <ogerlitz@mellanox.com>
>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>> Acked-by: Tejun Heo <tj@kernel.org>
>
> Thanks, Michael.  I put this in my for-linus branch:
>
> http://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/log/?h=for-linus
>
> I'll send a pull request to Linus today.

Actually, let me make sure I understand this correctly:  This patch
fixes the lockdep warning, but it does not fix an actual deadlock or
make any functional change.  Is that right?

If that's true, how much pain would it cause to just hold this for
v3.9.1?  I'm nervous about doing a warning fix when we're only a day
or two before releasing v3.9.

Bjorn

>> ---
>>
>> Changes from v1:
>>     - clarified commit log and added Ack by Tejun Heo
>>       patch is unchanged.
>>
>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
>> index 1fa1e48..6eeb5ec 100644
>> --- a/drivers/pci/pci-driver.c
>> +++ b/drivers/pci/pci-driver.c
>> @@ -286,8 +286,8 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
>>                 int cpu;
>>
>>                 get_online_cpus();
>>                 cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask);
>> -               if (cpu < nr_cpu_ids)
>> +               if (cpu != raw_smp_processor_id() && cpu < nr_cpu_ids)
>>                         error = work_on_cpu(cpu, local_pci_probe, &ddi);
>>                 else
>>                         error = local_pci_probe(&ddi);
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin April 19, 2013, 2:36 p.m. UTC | #3
On Thu, Apr 18, 2013 at 03:57:36PM -0600, Bjorn Helgaas wrote:
> On Thu, Apr 18, 2013 at 3:40 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> > On Thu, Apr 18, 2013 at 2:08 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> The following lockdep report triggers since 3.9-rc1:
> >>
> >> =============================================
> >> [ INFO: possible recursive locking detected ]
> >> 3.9.0-rc1 #96 Not tainted
> >> ---------------------------------------------
> >> kworker/0:1/734 is trying to acquire lock:
> >>  ((&wfc.work)){+.+.+.}, at: [<ffffffff81066cb0>] flush_work+0x0/0x250
> >>
> >> but task is already holding lock:
> >>  ((&wfc.work)){+.+.+.}, at: [<ffffffff81064352>]
> >> process_one_work+0x162/0x4c0
> >>
> >> other info that might help us debug this:
> >>  Possible unsafe locking scenario:
> >>
> >>        CPU0
> >>        ----
> >>   lock((&wfc.work));
> >>   lock((&wfc.work));
> >>
> >>  *** DEADLOCK ***
> >>
> >>  May be due to missing lock nesting notation
> >>
> >> 3 locks held by kworker/0:1/734:
> >>  #0:  (events){.+.+.+}, at: [<ffffffff81064352>]
> >> process_one_work+0x162/0x4c0
> >>  #1:  ((&wfc.work)){+.+.+.}, at: [<ffffffff81064352>]
> >> process_one_work+0x162/0x4c0
> >>  #2:  (&__lockdep_no_validate__){......}, at: [<ffffffff812db225>]
> >> device_attach+0x25/0xb0
> >>
> >> stack backtrace:
> >> Pid: 734, comm: kworker/0:1 Not tainted 3.9.0-rc1 #96
> >> Call Trace:
> >>  [<ffffffff810948ec>] validate_chain+0xdcc/0x11f0
> >>  [<ffffffff81095150>] __lock_acquire+0x440/0xc70
> >>  [<ffffffff81095150>] ? __lock_acquire+0x440/0xc70
> >>  [<ffffffff810959da>] lock_acquire+0x5a/0x70
> >>  [<ffffffff81066cb0>] ? wq_worker_waking_up+0x60/0x60
> >>  [<ffffffff81066cf5>] flush_work+0x45/0x250
> >>  [<ffffffff81066cb0>] ? wq_worker_waking_up+0x60/0x60
> >>  [<ffffffff810922be>] ? mark_held_locks+0x9e/0x130
> >>  [<ffffffff81066a96>] ? queue_work_on+0x46/0x90
> >>  [<ffffffff810925dd>] ? trace_hardirqs_on_caller+0xfd/0x190
> >>  [<ffffffff8109267d>] ? trace_hardirqs_on+0xd/0x10
> >>  [<ffffffff81066f74>] work_on_cpu+0x74/0x90
> >>  [<ffffffff81063820>] ? keventd_up+0x20/0x20
> >>  [<ffffffff8121fd30>] ? pci_pm_prepare+0x60/0x60
> >>  [<ffffffff811f9293>] ? cpumask_next_and+0x23/0x40
> >>  [<ffffffff81220a1a>] pci_device_probe+0xba/0x110
> >>  [<ffffffff812dadca>] ? driver_sysfs_add+0x7a/0xb0
> >>  [<ffffffff812daf1f>] driver_probe_device+0x8f/0x230
> >>  [<ffffffff812db170>] ? __driver_attach+0xb0/0xb0
> >>  [<ffffffff812db1bb>] __device_attach+0x4b/0x60
> >>  [<ffffffff812d9314>] bus_for_each_drv+0x64/0x90
> >>  [<ffffffff812db298>] device_attach+0x98/0xb0
> >>  [<ffffffff81218474>] pci_bus_add_device+0x24/0x50
> >>  [<ffffffff81232e80>] virtfn_add+0x240/0x3e0
> >>  [<ffffffff8146ce3d>] ? _raw_spin_unlock_irqrestore+0x3d/0x80
> >>  [<ffffffff812333be>] pci_enable_sriov+0x23e/0x500
> >>  [<ffffffffa011fa1a>] __mlx4_init_one+0x5da/0xce0 [mlx4_core]
> >>  [<ffffffffa012016d>] mlx4_init_one+0x2d/0x60 [mlx4_core]
> >>  [<ffffffff8121fd79>] local_pci_probe+0x49/0x80
> >>  [<ffffffff81063833>] work_for_cpu_fn+0x13/0x20
> >>  [<ffffffff810643b8>] process_one_work+0x1c8/0x4c0
> >>  [<ffffffff81064352>] ? process_one_work+0x162/0x4c0
> >>  [<ffffffff81064cfb>] worker_thread+0x30b/0x430
> >>  [<ffffffff810649f0>] ? manage_workers+0x340/0x340
> >>  [<ffffffff8106cea6>] kthread+0xd6/0xe0
> >>  [<ffffffff8106cdd0>] ? __init_kthread_worker+0x70/0x70
> >>  [<ffffffff8146daac>] ret_from_fork+0x7c/0xb0
> >>  [<ffffffff8106cdd0>] ? __init_kthread_worker+0x70/0x70
> >>
> >> The issue is that a driver, in it's probe function, calls
> >> pci_sriov_enable so a PF device probe causes VF probe (AKA nested
> >> probe).  Each probe in pci_device_probe which is (normally) run through
> >> work_on_cpu (this is to get the right numa node for memory allocated by
> >> the driver).  In turn work_on_cpu does this internally:
> >>
> >>         schedule_work_on(cpu, &wfc.work);
> >>         flush_work(&wfc.work);
> >>
> >> So if you are running probe on CPU1, and cause another
> >> probe on the same CPU, this will try to flush
> >> workqueue from inside same workqueue which causes
> >> a lockep warning.
> >>
> >> Nested probing might be tricky to get right generally.
> >>
> >> But for pci_sriov_enable, the situation is actually very simple: all VFs
> >> naturally have same affinity as the PF, and cpumask_any_and is actually
> >> same as cpumask_first_and, so it always gives us the same CPU.
> >> So let's just detect that, and run the probing for VFs locally without a
> >> workqueue.
> >>
> >> This is hardly elegant, but looks to me like an appropriate quick fix
> >> for 3.9.
> >>
> >> Tested-by: Or Gerlitz <ogerlitz@mellanox.com>
> >> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >> Acked-by: Tejun Heo <tj@kernel.org>
> >
> > Thanks, Michael.  I put this in my for-linus branch:
> >
> > http://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/log/?h=for-linus
> >
> > I'll send a pull request to Linus today.
> 
> Actually, let me make sure I understand this correctly:  This patch
> fixes the lockdep warning, but it does not fix an actual deadlock or
> make any functional change.  Is that right?

Tejun said that this warning is a false positive, so yes.

> If that's true, how much pain would it cause to just hold this for
> v3.9.1?  I'm nervous about doing a warning fix when we're only a day
> or two before releasing v3.9.
> 
> Bjorn

I don't have this hardware, so I don't know. It was apparently reported
by real users ...

> >> ---
> >>
> >> Changes from v1:
> >>     - clarified commit log and added Ack by Tejun Heo
> >>       patch is unchanged.
> >>
> >> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> >> index 1fa1e48..6eeb5ec 100644
> >> --- a/drivers/pci/pci-driver.c
> >> +++ b/drivers/pci/pci-driver.c
> >> @@ -286,8 +286,8 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
> >>                 int cpu;
> >>
> >>                 get_online_cpus();
> >>                 cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask);
> >> -               if (cpu < nr_cpu_ids)
> >> +               if (cpu != raw_smp_processor_id() && cpu < nr_cpu_ids)
> >>                         error = work_on_cpu(cpu, local_pci_probe, &ddi);
> >>                 else
> >>                         error = local_pci_probe(&ddi);
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas April 19, 2013, 4:39 p.m. UTC | #4
On Thu, Apr 18, 2013 at 9:04 PM, Tejun Heo <tj@kernel.org> wrote:
> So, the thing is there is no actual deadlock. If you're okay with releasing
> w/ spurious lockdep warning, leaving things alone should be fine. An issue
> with mst's patch is that it actually changes the behavior to avoid a
> spurious warning. An alternative course would be leaving it alone now and
> remove the spurious warning during the coming devel cycle and mark it w/
> -stable.

If I understand correctly, you need v3.9-rc1 or later,
CONFIG_PROVE_LOCKING=y, and an SR-IOV device to see the warning.

I like the idea of fixing the spurious warning for v3.10 and
backporting to -stable.  It sounds like there's a cleaner fix in the
works that needs a bit more polishing.  If we need a quick fix sooner,
we'll still have this one in our back       pocket.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin April 20, 2013, 7:05 p.m. UTC | #5
On Thu, Apr 18, 2013 at 08:04:47PM -0700, Tejun Heo wrote:
> So, the thing is there is no actual deadlock. If you're okay with releasing w/
> spurious lockdep warning, leaving things alone should be fine. An issue with
> mst's patch is that it actually changes the behavior to avoid a spurious
> warning.

Yes but in fact, isn't it cleaner to avoid work_on if we are going to run
on the local CPU, anyway?

> An alternative course would be leaving it alone now and remove the
> spurious warning during the coming devel cycle and mark it w/ -stable.
> 
> Thanks.

Okay. Could you send tested a version of work_on_nested?

> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

=============================================
[ INFO: possible recursive locking detected ]
3.9.0-rc1 #96 Not tainted
---------------------------------------------
kworker/0:1/734 is trying to acquire lock:
 ((&wfc.work)){+.+.+.}, at: [<ffffffff81066cb0>] flush_work+0x0/0x250

but task is already holding lock:
 ((&wfc.work)){+.+.+.}, at: [<ffffffff81064352>]
process_one_work+0x162/0x4c0

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock((&wfc.work));
  lock((&wfc.work));

 *** DEADLOCK ***

 May be due to missing lock nesting notation

3 locks held by kworker/0:1/734:
 #0:  (events){.+.+.+}, at: [<ffffffff81064352>]
process_one_work+0x162/0x4c0
 #1:  ((&wfc.work)){+.+.+.}, at: [<ffffffff81064352>]
process_one_work+0x162/0x4c0
 #2:  (&__lockdep_no_validate__){......}, at: [<ffffffff812db225>]
device_attach+0x25/0xb0

stack backtrace:
Pid: 734, comm: kworker/0:1 Not tainted 3.9.0-rc1 #96
Call Trace:
 [<ffffffff810948ec>] validate_chain+0xdcc/0x11f0
 [<ffffffff81095150>] __lock_acquire+0x440/0xc70
 [<ffffffff81095150>] ? __lock_acquire+0x440/0xc70
 [<ffffffff810959da>] lock_acquire+0x5a/0x70
 [<ffffffff81066cb0>] ? wq_worker_waking_up+0x60/0x60
 [<ffffffff81066cf5>] flush_work+0x45/0x250
 [<ffffffff81066cb0>] ? wq_worker_waking_up+0x60/0x60
 [<ffffffff810922be>] ? mark_held_locks+0x9e/0x130
 [<ffffffff81066a96>] ? queue_work_on+0x46/0x90
 [<ffffffff810925dd>] ? trace_hardirqs_on_caller+0xfd/0x190
 [<ffffffff8109267d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff81066f74>] work_on_cpu+0x74/0x90
 [<ffffffff81063820>] ? keventd_up+0x20/0x20
 [<ffffffff8121fd30>] ? pci_pm_prepare+0x60/0x60
 [<ffffffff811f9293>] ? cpumask_next_and+0x23/0x40
 [<ffffffff81220a1a>] pci_device_probe+0xba/0x110
 [<ffffffff812dadca>] ? driver_sysfs_add+0x7a/0xb0
 [<ffffffff812daf1f>] driver_probe_device+0x8f/0x230
 [<ffffffff812db170>] ? __driver_attach+0xb0/0xb0
 [<ffffffff812db1bb>] __device_attach+0x4b/0x60
 [<ffffffff812d9314>] bus_for_each_drv+0x64/0x90
 [<ffffffff812db298>] device_attach+0x98/0xb0
 [<ffffffff81218474>] pci_bus_add_device+0x24/0x50
 [<ffffffff81232e80>] virtfn_add+0x240/0x3e0
 [<ffffffff8146ce3d>] ? _raw_spin_unlock_irqrestore+0x3d/0x80
 [<ffffffff812333be>] pci_enable_sriov+0x23e/0x500
 [<ffffffffa011fa1a>] __mlx4_init_one+0x5da/0xce0 [mlx4_core]
 [<ffffffffa012016d>] mlx4_init_one+0x2d/0x60 [mlx4_core]
 [<ffffffff8121fd79>] local_pci_probe+0x49/0x80
 [<ffffffff81063833>] work_for_cpu_fn+0x13/0x20
 [<ffffffff810643b8>] process_one_work+0x1c8/0x4c0
 [<ffffffff81064352>] ? process_one_work+0x162/0x4c0
 [<ffffffff81064cfb>] worker_thread+0x30b/0x430
 [<ffffffff810649f0>] ? manage_workers+0x340/0x340
 [<ffffffff8106cea6>] kthread+0xd6/0xe0
 [<ffffffff8106cdd0>] ? __init_kthread_worker+0x70/0x70
 [<ffffffff8146daac>] ret_from_fork+0x7c/0xb0
 [<ffffffff8106cdd0>] ? __init_kthread_worker+0x70/0x70

The issue is that a driver, in it's probe function, calls
pci_sriov_enable so a PF device probe causes VF probe (AKA nested
probe).  Each probe in pci_device_probe which is (normally) run through
work_on_cpu (this is to get the right numa node for memory allocated by
the driver).  In turn work_on_cpu does this internally:

        schedule_work_on(cpu, &wfc.work);
        flush_work(&wfc.work);

So if you are running probe on CPU1, and cause another
probe on the same CPU, this will try to flush
workqueue from inside same workqueue which causes
a lockep warning.

Nested probing might be tricky to get right generally.

But for pci_sriov_enable, the situation is actually very simple: all VFs
naturally have same affinity as the PF, and cpumask_any_and is actually
same as cpumask_first_and, so it always gives us the same CPU.
So let's just detect that, and run the probing for VFs locally without a
workqueue.

This is hardly elegant, but looks to me like an appropriate quick fix
for 3.9.

Tested-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>

---

Changes from v1:
    - clarified commit log and added Ack by Tejun Heo
      patch is unchanged.

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 1fa1e48..6eeb5ec 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -286,8 +286,8 @@  static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 		int cpu;
 
 		get_online_cpus();
 		cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask);
-		if (cpu < nr_cpu_ids)
+		if (cpu != raw_smp_processor_id() && cpu < nr_cpu_ids)
 			error = work_on_cpu(cpu, local_pci_probe, &ddi);
 		else
 			error = local_pci_probe(&ddi);